Go's Memory Allocator - Overview

I’ve decided to learn how the Go runtime allocates memory and this is the first in a series of posts that are going to cover what I learned (and also the process itself: how I learned). These posts are based of my notes for the talk I gave at Gophercon UK in 2018:

Slides are available here.

In this first post we are going to cover:

  1. Memory allocation 101
  2. How the allocator is invoked
  3. An overview of the tcmalloc algorithm
  4. An overview of the Go’s runtime allocator

The objective of this post is to give a broad idea of how allocation works on the runtime, without actually showing any of it’s code. The following posts on the series will be more dense and with lots of code samples extracted from the runtime. Each of the future posts will cover an specific detail of the allocator.

Disclaimer: The code samples displayed in this series of posts were redacted to simplify the reading. The full code is available on golang/go repository. This post is based of the 1.10 branch. In the last post of the series, we are going to cover the main changes in go 1.11.

Disclaimer 2: I’m not a go runtime expert. Just someone curious and interested in those kinds of things. Feel free to reach me out and correct me.

Disclaimer 3: Most of the content in these posts are implementation details and are not defined on the Go Specification. There are no guarantees that they are going to stay the same, approach all this with caution. For instance, the Go Spec says nothing about heap or stack.

0. Memory allocation 101

Before reading this post, I recommend reading my previous post on memory allocation. That post covers the basics needed to understand what we are going to cover in this series.

1. How the allocator gets invoked

In languages like C, it’s pretty obvious when the memory allocator is invoked: that happens when you can malloc (or one of its variations) and free.

We know that Go has no malloc function and also you are not able to actively free memory allocations because the language is garbage collected. So, one may wonder how the memory allocation code is invoked in our Go programs.

Imagine we have the following Go code:

package main

func main() {
	f()
}

//go:noinline
func f() *int {
	i := 10
	return &i
}

The f() function is marked //go:noinline to prevent the compiler from inlining the call to f(), removing the function call and making the example a little less trivial to understand.

The compiler uses a technique called escape analysis to choose wether a variable is going to be placed on the heap or the stack. We are interested in heap allocations and that is exactly where i is allocated. We can see that by running go build passing -gcflags "-m -m":

The -gcflags flag is way to pass flags to the go compiler during the build and the "-m" flag tells the compiler to output optmization decisions (passing it two times increases verbosity). One can also use "-m=2".

$ go build -gcflags "-m -m" main.go
# command-line-arguments
./main.go:8:6: cannot inline f: marked go:noinline
./main.go:3:6: cannot inline main: non-leaf function
./main.go:10:9: &i escapes to heap
./main.go:10:9: 	from ~r0 (return) at ./main.go:10:2
./main.go:9:2: moved to heap: i

Imagine if i was allocated on the stack. That means that &i would be the address of a memory area on f()’s stack frame. When f() returns, it’s stack frame becomes invalid memory and thus main() can’t access any of it. Since f() is using pointer semantics to return the address i the compiler figures out that i should live on the heap, since it must be allocated on a memory area that remains valid even after f() returns.

I recommend reading Language Mechanics On Escape Analysis to know more about escape analysis.

So, we know that i is going to be allocated on the heap. But how does the runtime set that up? With the compiler’s help! We can get an idea from reading the generated assembly.

The following output was redacted for brevity, showing only the relevant parts. Some resources on Go assembly are linked on the end of the post.

$ go tool compile -S main.go
...
0x001d 00029 (main.go:9)	LEAQ	type.int(SB), AX
0x0024 00036 (main.go:9)	MOVQ	AX, (SP)
0x0028 00040 (main.go:9)	PCDATA	$0, $0
0x0028 00040 (main.go:9)	CALL	runtime.newobject(SB)
...

This shows a call to runtime.newobject passing the address of the type int.

func newobject(typ *_type) unsafe.Pointer {
	return mallocgc(typ.size, typ, true)
}

mallogc is the entrypoint to the runtime memory allocation algorithm. It receives the size, type and wether or not this memory should return zeroed (true in this case).

There are other places mallocgc gets called (e.g runtime.makeslice etc). We are going to cover mallocgc in details in the next posts, all we need to know is that it does the heap allocation work and is a implementation of the algorithm we are going to cover at a high level now.

2. Thread Cache Malloc (tcmalloc)

As stated on some comments at the code:

The Go runtime allocator was initially based on tcmalloc but has diverged quite a bit.

The page tcmalloc page, also linked in the code comments, has a great overview of how the algorithm works.

I’ll briefly summarize tcmalloc in this section. I recommend reading the refered page as a way to get into more details and also checking some of the benchmarks comparing tcmalloc with glibc malloc.

We can divide allocations into 2 buckets: small and large allocations. Most of the allocations are served from local thread caches, meaning that no locks are required, reducing contention and increasing the overall performance.

2.1 Small allocations

Small allocations are allocations with size <= 32kB.

  1. The requested sized is rounded up to one of the 170 size classes; e.g, allocations in the range 961 to 1024 bytes are rounded up to 1024.
  2. tcmalloc looks up at the local thread cache to see if there are free objects available for this size class:
    1. If there are free objects, it returns the first one from the list and we are all done. This allocation required no locks as it was totally served from the local thread cache.
    2. If there are no free objects, fetch new objects from the central cache.
  3. The central cache checks if it has free objects:
    1. If there are free objects, return a bunch to the local cache;
    2. If there are no free objects, request a chunck of memory pages from the central heap, split it into objects for this size class and return them to the local cache.

2.2 Large allocations

Large allocations are served directly by the central heap. They are rounded up to a page size 4K and a run of contiguous pages are called Spans.

The central heap maintains a array of free lists. For k < 256, the kth entry is a free list of Spans that consist of k pages. The 256th entry is a free list of Spans that have length >= 256 pages.

  1. the central heap will search for a Span on the k-th free list
    1. If one is found, return it to the user
    2. If one is not found, look for it on the k+1-th free list, and so on…
  2. If a free Span is not found in any of the free lists, allocate a large chunk of memory from the OS
  3. If the Span found, or the memory requested from the OS, is larger than the user requested, split it into multiple spans and place then on the free lists before returning to the user.

3. Overview of the Go’s Runtime Allocator

3.1 Allocations

We can divide Go heap allocations into 3 groups (as opposed to the 2 groups in tcmalloc). Each of these types are going to be covered in separated posts on this series.

  • Small allocations (< 32 kB)

Small allocations in Go are pretty similar to tcmalloc. It follows from the most local to the most central structure:

mcache -> mcentral -> mheap

  • Tiny Allocations (< 16 B and no pointers)

Follows the same path as small allocations but is stored, managed and freed as 16 B memory blocks by the mcache.

  • Large Allocations (> 32 kB)

These are served directly by the mheap, first by checking its free lists and asking for more memory from the OS when needed.

3.2 Data structures

In this section we are going to get a high level view of how each of the mentioned data structures operate.

The following diagram displays the main components that are going to be described later:

      +------------+ +------------------------+
+---+ |            | |        mheap           |
| P | |   mcache   | |                        |
+---+ |            | |  +----------+          |
      +------------+ |  | mcentral |          |
                     |  +----------+          |
                     |  +----------+          |
                     |  | mcentral |          |
                     |  +----------+          |
                     |       .                |
                     |       .                |
                     |       .                |
      +------------+ |  +----------+          |
+---+ |            | |  | mcentral |          |
| P | |   mcache   | |  +----------+          |
+---+ |            | |                        |
      +------------+ +------------------------+

We can see that each logical processor (P) has its own mcache, which is the local thread cache. There is a single mheap, which plays the part of the central heap. Surprisinly, there are multiple mcentral objects. There is one for each of the size classes. This means that even when requesting for a span from the mcentral we can reduce contention by only acquiring a lock for the requested size class. Goroutines in other processors that also require allocating from the mcentral but are requesting objects for a different size class won’t find that mcentral locked.

Thread Cache (mcache)

The thread cache is implemented by the mcache structure. There is one mcache for each P (logical processor).

        +--------------------+                          
        |       mcache       |                          
        |                    |                          
        |          +-------+ |                          
        |     8 B  | mspan | |                          
        |          +-------+ |                          
        |          +-------+ |                          
        |    16 B  | mspan | |                          
        |          +-------+ |                          
        |              .     |                          
        |              .     |                          
        |              .     |                          
        |          +-------+ |                          
        |   32 kB  | mspan | |                          
        |          +-------+ |                          
        |                    |                          
        |   +-------------+  |                          
        |   | tiny block  |  |                          
        |   +-------------+  |                          
        +--------------------+           

The mcache maintains a mspan of each class (for objects with 8 bytes up to 32 kB). Each of these spans are able to allocate multiple objects of that size. If the span for a given class is full, the mcache requests a new one from the corresponding mcentral structure for that size class.

Since it is a per-P structure, there is no need to hold locks when allocating from the mcache which is a interesting optimization.

The mcache also maintains a block that is used for tiny allocations (this is a concept that is not on tcmalloc) of objects that contain no pointers. When are going to cover how this works on a post about Tiny Allocation in go.

Central free list (mcentral)

The central free list is implemented by mcentral. There is one mcentral for each span class sizes and they are stored inside the mheap struct, which we will cover soon.

+-------------------------+                        
|        mcentral         |                        
|                         |                        
|            +----------+ |    +-----+      +-----+
|     empty  |mSpanList | |--->|mspan|<---->|mspan|
|            +----------+ |    +-----+      +-----+
|                         |                        
|            +----------+ |    +-----+      +-----+
|   nonempty |mSpanList | |--->|mspan|<---->|mspan|
|            +----------+ |    +-----+      +-----+
|       +-------+         |                        
|       | lock  |         |                        
|       +-------+         |                        
+-------------------------+                         

Each mcentral holds a lock that needs to be acquired when allocating from it, since mcentral is shared by all Ps.

It maintains two double linked lists (of type mSpanList), one for spans that contain free objects and one for spans that don’t contain any free object or that are already in a mcache.

When a mcache requests a new span, the mcentral traverses both lists looking for spans that are available and can be handled to the mcache.

If there is no span available, the mcentral requests for a new span from the central heap (mheap) and returns this new span to the mcache.

Central Heap (mheap)

The central heap is implemented by the type mheap struct{}. Large objects (> 32 kB) are directly allocated from it. Since this structure is shared by all threads, allocating from it incurs in locking. This is the largest structure on the allocator so we are going to break it down.

+----------------------------------------------------+  
|                        mheap                       |  
|                                                    |  
|+-----------------------+ +-----------------------+ |  
||         free          | |         busy          | |  
||           +---------+ | |           +---------+ | |  
||   1 page  |mSpanList| | |   1 page  |mSpanList| | |  
||           +---------+ | |           +---------+ | |  
||          .            | |          .            | |  
||          .            | |          .            | |  
||          .            | |          .            | |  
||           +---------+ | |           +---------+ | |  
|| 127 pages |mSpanList| | | 127 pages |mSpanList| | |  
||           +---------+ | |           +---------+ | |  
||                       | |                       | |  
||           +---------+ | |           +---------+ | |  
||128+ pages | mTreap  | | |128+ pages |mSpanList| | |  
||           +---------+ | |           +---------+ | |  
|+-----------------------+ +-----------------------+ |  
|                     +---------+                    |  
|                     |  lock   |                    |  
|                     +---------+                    |  
+----------------------------------------------------+  

The mheap, as on tcmalloc, maintains an array of free linked lists. The i-th item in the array contains runs of pages that consist of i pages.

The mheap also maintains a mtreap with runs of pages that are larger then 127 pages. This used to be a linked list, but changed in this commit due to performance on large heaps.

A treap is a probabilistically balanced binary tree that offers O(logN) behavior for inserting and removing spans. The tree is kept sorted both by span size and the span address is used as second sorting key.

When asked for n pages of memory, the mheap traverses its free linked lists for spans greater or equal to n. In case there are no free spans, the mheap asks the OS for more. In linux, it does so by using the mmap(2) syscall to ask for at least 1MB. Contiguous free spans are subject to coalescing and are placed on the corresponding free lists.

The mheap maintains a lookup table of virtual page address IDs to span. A large enough area to map every page that can be allocated is reserved upon initialization. This lookup table can be used to find the span responsible for a given page and is the “central array” described in tcmalloc’s page.

If the span is allocated, all pages map to the span itself.

       +--------|--------|--------|--------+
 spans | page 1 | page 2 | page 3 | page 4 |
       +=--|----|------|-|--|-----|----|---+
           |           |    |          |    
           |           |    |          |    
           |           |    |          |    
           |           |    |          |    
           |           |    |          |    
           |           |    |          |    
           |         +-v----v---+      |    
           +--------->  mspan   <------+    
                     +----------+           

If the span is not allocated, it’s first and last page points to the span. The rest points to an arbitrary span.

       +--------|--------|--------|--------+
 spans | page 1 | page 2 | page 3 | page 4 |
       +---|----|--------|--------|----|---+
           |                           |    
           |                           |    
           |                           |    
           |                           |    
           |                           |    
           |                           |    
           |         +----------+      |    
           +--------->  mspan   <------+    
                     +----------+           

Pages that have never been allocated are nil entries in this lookup table.

The mheap also maintains:

  1. the arena, which is the range of addresses used for the Go heap. This is used by the mheap to know when to request more memory from the OS and at what virtual address this new mapping needs to start;
  2. A bitmap used by the garbage collector to know what is still live and can’t be collected;
  3. A bunch of fixed sized allocators that are used to manually allocate the structures used internally by the allocator.

Spans (mspan)

A span is of type mspan and represents a run of contiguous pages that are managed by the mheap. This is the object travels between the mheap and the caches (mcentral and mcache).

Every span has a startAddr which, as the name says, is the address of its first byte and a npages which is the number of pages that this span has.

For small allocations, which means that the span is of one of the given size classes, we can imagine the span as a container for nelems of that given size. The mcache needs to check if the span has still room for another object of that size.

For large allocations (> 32kB), the span holds a single object and a pointer to its base address (startAddr) is returned by mallocgc.

Wrapping Up

I think this is enough for a first, introductory post. On the next post we are going to start exploring the runtime code and see how it handles large object allocations.

References

  1. Go internals, Chapter 1: Go assembly
  2. Language Mechanics On Escape Analysis
  3. tcmalloc

Using cgroups to limit I/O

Developers running their apps on tsuru can choose plans based on memory and cpu usage. We were looking at adding I/O usage as one of the plan’s features, so one could choose how many I/O operations per second or bytes per second a container may be able to read/write.

Being able to limit I/O is particulary important when running a diverse cloud, where applications with different workloads and needs are running together sharing multiple resources. We need to make sure that an application that starts to behave badly does not interfere with others.

Our main scheduler is a docker based and docker exposes some flags to limit a container IOPS (I/O operations per second) and BPS (bytes per second). Docker relies on a linux kernel feature, called cgroups, to be able to limit a process resource usage.

Before exposing this as a possible parameter on tsuru plans, I decided to investigate and do some experimentation using cgroups to limit a process I/O (IOPS or BPS). I came to a conclusion that currently, it is not possible to fulfill our needs and decided to delay the implementation.

In the next section, we are going to discuss cgroups, the main kernel feature used to limit resource usage.

Introduction to cgroups

Cgroups are a mechanism available in Linux to aggregate/partition a set of tasks and all their future children. Different subsystems may hook into cgroups to provide different behaviors, such as resource accounting/limiting (this particular kind of subsystem is called controller).

Cgroups (along with namespaces) are one building blocks of containers, but you don’t need a container runtime to make use of them.

Managing cgroups is done by interacting with the cgroup filesystem, by creating directories and writing to certain files. There are two versions of cgroups available in newest kernels: v1 and v2. Cgroups v2 completely changes the interface between userspace and the kernel and, as of today, container runtimes only support cgroups v1, so we will focus on v1 first.

Cgroups v1

Cgroups v1 has a per-resource (memory, blkio etc) hierarchy, where each resource hierarchy contains cgroups for that resource. They all live in /sys/fs/cgroup:

/sys/fs/cgroup/
              resourceA/
                        cgroup1/
                        cgroup2/
              resourceB/
                        cgroup3/
                        cgroup4/

Each PID is in exactly one cgroup per resource. If a PID is not assigned to a specific cgroup for a resource, it is in the root cgroup for that particular resource. Important: Even if a cgroup has the same name in resourceA and resourceB they are considered distinct. Some of the available resources are:

  • blkio: Block IO controller - limit I/O usage
  • cpuacct: CPU accounting controller - accouting for CPU usage
  • cpuset: CPU set controller - allows assigning a set of CPUs and memory nodes to a set of tasks
  • memory: Memory resource controller - limit memory usage
  • hugeTLB: Huge TLB controller - allows limiting the usage of huge pages
  • devices: Device whitelist controller - enforce open and mknode restrictions on device files
  • pids: Process number controller - limit the number of tasks

In the next section, we are going to investigate the use of the blkio controller to limit the I/O bytes per second of a task running under a cgroup.

Limiting I/O with cgroups v1

To limit I/O we need to create a cgroup in the blkio controller.

$ mkdir -p /sys/fs/cgroup/blkio/g1

We are going to set our limit using blkio.throttle.write_bps_device file. This requires us to specify limits by device, so we must find out our device major and minor version:

$ cat /proc/partitions
major minor  #blocks  name

   8        0   10485760 sda
   8        1   10484719 sda1
   8       16      10240 sdb

Let’s limit the write bytes per second to 1048576 (1MB/s) on the sda device (8:0):

$ echo "8:0 1048576" > /sys/fs/cgroup/blkio/g1/blkio.throttle.write_bps_device

Let’s place our shell into the cgroup, by writing its PID to the cgroup.procs file, so every command we start will run in this cgroup:

$ echo $$ > /sys/fs/cgroup/blkio/g1/cgroup.procs

Let’s run dd to generate some I/O workload while watching the I/O workload using iostat:

$ dd if=/dev/zero of=/tmp/file1 bs=512M count=1
1+0 records in
1+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 1.25273 s, 429 MB/s

At the same time, we get this output from iostat (redacted):

$ iostat 1 -d -h -y -k -p sda
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                609.00         4.00    382400.00          4     382400
sda1
                609.00         4.00    382400.00          4     382400

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                260.00       212.00    133696.00        212     133696
sda1
                260.00       212.00    133696.00        212     133696

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                  0.99         0.00       859.41          0        868
sda1
                  0.99         0.00       859.41          0        868
...

We were able to write 429 MB/s; we can see from iostat output that all 512MB were writen in the same second. If we try the same command but opening the file with O_DIRECT flag (passing oflag=direct to dd):

$ dd if=/dev/zero of=/tmp/file1 bs=512M count=1 oflag=direct
1+0 records in
1+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 539.509 s, 995 kB/s

At the same time, we get this output from iostat (redacted):

$ iostat 1 -d -h -y -k -p sda
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                  1.00         0.00      1024.00          0       1024
sda1
                  1.00         0.00      1024.00          0       1024

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                  1.00         0.00      1024.00          0       1024
sda1
                  1.00         0.00      1024.00          0       1024

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                  1.00         0.00      1024.00          0       1024
sda1
                  1.00         0.00      1024.00          0       1024
...

Et voila! Our writes were below 1.0 MB/s. Why didn’t it work on the first try? Lets try to understand on the next section.

The I/O Path

So, whats the difference between opening the file with O_DIRECT and opening the file with no flags? The article Ensuring Data Reaches the Disk does a pretty good job explaining all the I/O flavors in Linux and their different paths on the kernel.

data path in I/O

That data starts out as one or more blocks of memory, or buffers, in the application itself. Those buffers can also be handed to a library, which may perform its own buffering. Regardless of whether data is buffered in application buffers or by a library, the data lives in the application’s address space. The next layer that the data goes through is the kernel, which keeps its own version of a write-back cache called the page cache. Dirty pages can live in the page cache for an indeterminate amount of time, depending on overall system load and I/O patterns. When dirty data is finally evicted from the kernel’s page cache, it is written to a storage device (such as a hard disk). The storage device may further buffer the data in a volatile write-back cache. If power is lost while data is in this cache, the data will be lost. Finally, at the very bottom of the stack is the non-volatile storage. When the data hits this layer, it is considered to be “safe.”

Basically, when we write to a file (opened without any special flags), the data travels across a bunch of buffers and caches before it is effectively writen to the disk.

Opening a file with O_DIRECT (available since Linux 2.4.10), means (from man pages):

Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers.

So, for some reason, when we bypassed the kernel’s page cache, the cgroup was able to enforce the I/O limit specified.

This commit in Linux adds some documentation that explains exactly what is happening.

On traditional cgroup hierarchies, relationships between different controllers cannot be established making it impossible for writeback to operate accounting for cgroup resource restrictions and all writeback IOs are attributed to the root cgroup.

It’s important to notice that this was added when cgroups v2 were already a reality (but still experimental). So the “traditional cgroup hierarchies” means cgroups v1. Since in cgroups v1, different resources/controllers (memory, blkio) live in different hierarchies on the filesystem, even when those cgroups have the same name, they are completely independent. So, when the memory page is finally being flushed to disk, there is no way that the memory controller can know what blkio cgroup wrote that page. That means it is going to use the root cgroup for the blkio controller.

Right below that statement, we find:

If both the blkio and memory controllers are used on the v2 hierarchy and the filesystem supports cgroup writeback writeback operations correctly follow the resource restrictions imposed by both memory and blkio controllers.

So, in order to limit I/O when this I/O may hit the writeback kernel cache, we need to use both memory and io controllers in the cgroups v2!

Cgroups v2

Since kernel 4.5, the cgroups v2 implementation was marked non-experimental.

In cgroups v2 there is only a single hierarchy, instead of one hierarchy for resource. Supposing the cgroups v2 file system is mounted in `/sys/fs/cgroup/:

/sys/fs/cgroup/
              cgroup1/
                cgroup3/
              cgroup2/        
                cgroup4/

This hierarchy means that we may impose limits to both I/O and memory by writing to files in the cgroup1 cgroup. Also, those limits may be inherited by cgroup3. Not every controller supported in cgroups v1 is available in cgroups v2. Currently, one may use: memory, io, rdma and pids controller.

Let’s try the same experiment as before using cgroups v2 this time. The following example uses Ubuntu 17.04 (4.10.0-35-generic).

Disabling cgroups v1

First of all, we need to disable cgroups v1. To do that, I’ve added GRUB_CMDLINE_LINUX_DEFAULT="cgroup_no_v1=all" to /etc/default/grub and rebooted. That kernel config flag disables cgroup v1 for all controllers (blkio, memory, cpu and so on). This guarantees that both the io and memory controllers will be used on the v2 hierarchy (one of the requirements mentioned by the doc on Writeback mentioned earlier).

Limiting I/O

First, let’s mount the cgroups v2 filesystem in /cgroup2:

$ mount -t cgroup2 nodev /cgroup2

Now, create a new cgroup, called cg2 by creating a directory under the mounted fs:

$ mkdir /cgroup2/cg2

To be able to edit the I/O limits using the the I/O controller on the newly created cgroup, we need to write “+io” to the cgroup.subtree_control file in the parent (in this case, root) cgroup:

$ echo "+io" > /cgroup2/cgroup.subtree_control

Checking the cgroup.controllers file for the cg2 cgroup, we see that the io controller is enabled:

$ cat /cgroup2/cg2/cgroup.controllers
io

To limit I/O to 1MB/s, as done previously, we write into the io.max file:

$ echo "8:0 wbps=1048576" > io.max

Let’s add our bash session to the cg2 cgroup, by writing its PID to cgroup.procs:

$ echo $$ > /cgroup2/cg2/cgroup.procs

Now, lets use dd to generate some I/O workload and watch with iostat:

dd if=/dev/zero of=/tmp/file1 bs=512M count=1
1+0 records in
1+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 468.137 s, 1.1 MB/s

At the same time, we get this output from iostat (redacted):

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                  1.02         0.00       693.88          0        680
sda1
                  1.02         0.00       693.88          0        680

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                  2.00         0.00       732.00          0        732
sda1
                  2.00         0.00       732.00          0        732

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                  0.99         0.00       669.31          0        676
sda1
                  0.99         0.00       669.31          0        676

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                  1.00         0.00       672.00          0        672
sda1
                  1.00         0.00       672.00          0        672
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda
                  1.00         0.00      1024.00          0       1024
sda1
                  1.00         0.00      1024.00          0       1024
...

So, even relying on the writeback kernel cache we are able to limit the I/O on the disk.

Wrapping Up

We’ve seen how limiting I/O is hard using cgroups. After some experimentation and research we found out that using cgroups v2 is the only way to properly limit I/O (if you can’t change your application to do direct or sync I/O).

As a result of this experiment, we decided to not limit I/O using docker at the moment since it does not support cgroups v2 (yet).

Linux Delay Accounting

Ever wondered how long is your program spending while waiting for I/O to finish? Or if it is spending lots of time while waiting for a turn to run on one of the cpus? Linux provides delay accounting information that may help answering these and other questions. Delay information is available for many types of resources:

  1. waiting for a CPU (while being runnable)
  2. completion of synchronous block I/O initiated by the task
  3. swapping in pages
  4. memory reclaim

These information is available in nanoseconds, on a per pid/tid basis, and is pretty useful to find out if your system resources are saturated by the number of concurrent tasks running on the machine. You can either: reduce the amount of work being done on the machine by removing unecessary processes or adjust the priority (cpu priority, io priority and rss limit) for important tasks.

Acessing delay accounting information

This information is available for userspace programs thru the Netlink interface, an interface a user-space program in linux uses to communicate with the kernel. It can be used by a bunch of stuff: managing network interfaces, setting ip addresses and routes and so on.

Linux ships with a source code example, getdelays, on how to build tools to consume such information [2]. By using ./getdelays -d -p <PID> we can visualize the delay experienced by process while consuming different kinds of resources.


Side note: since this commit, Linux requires a process to run as root to be able to fetch delay accounting information. I plan to check up if these could be changed so an user may check delay information on any process owned by him/her.


getdelays states that “It is recommended that commercial grade applications use libnl or libnetlink and use the interfaces provided by the library”, so I decided to rewrite part of getdelays using a higher level library, instead of having to handle parsing and other instrinsics of the netlink protocol.

Re-implementing getdelays using libnl

I found libnl to be a quite flexible library and was able to write this example in a couple of hours (and I didn’t have any prior experience with netlink). Their documentation on the Netlink protocol had everything I needed to understand the protocol.

The source code for my implementation is available on my github and uses libnl to “talk” netlink. In the following sections I`ll highlight the most important parts of the implementation.

1. Setup

sk = nl_socket_alloc();
if (sk == NULL) {
    fprintf(stderr, "Error allocating netlink socket");
    exit_code = 1;
    goto teardown;
}

if ((err = nl_connect(sk, NETLINK_GENERIC)) < 0) {
    fprintf(stderr, "Error connecting: %s\n", nl_geterror(err));
    exit_code = 1;
    goto teardown;
}

if ((family = genl_ctrl_resolve(sk, TASKSTATS_GENL_NAME)) == 0) {
    fprintf(stderr, "Error retrieving family id: %s\n", nl_geterror(err));
    exit_code = 1;
    goto teardown;
}

The setup is pretty straightforward:

  1. we start by calling nl_socket_alloc() to allocate a netlink socket, required for the communication with the netlink interface
  2. the call to nl_connect connects our socket to the NETLINK_GENERIC protocol (depending on our needs, we can use other protocols like NETLINK_ROUTE for routing operations)
  3. gen_ctrl_resolve is used to obtain the family id of the taskstats. This is the “postal code” of the delay information holder

After the setup we are ready to prepare our netlink message.

2. Preparing our message

if ((err = nl_socket_modify_cb(sk, NL_CB_VALID, NL_CB_CUSTOM, callback_message, NULL)) < 0) {
        fprintf(stderr, "Error setting socket cb: %s\n", nl_geterror(err));
      exit_code = 1;
      goto teardown;
}

if (!(msg = nlmsg_alloc())) {
    fprintf(stderr, "Failed to alloc message: %s\n", nl_geterror(err));
    exit_code = 1;
    goto teardown;
}

if (!(hdr = genlmsg_put(msg, NL_AUTO_PID, NL_AUTO_SEQ, family, 0,
    NLM_F_REQUEST, TASKSTATS_CMD_GET, TASKSTATS_VERSION))) {
    fprintf(stderr, "Error setting message header\n");
    exit_code = 1;
    goto teardownMsg;
}

if ((err = nla_put_u32(msg, TASKSTATS_CMD_ATTR_PID, pid)) < 0) {
    fprintf(stderr, "Error setting attribute: %s\n", nl_geterror(err));
    exit_code = 1;
    goto teardownMsg;
}
  1. Libnl offers a bunch of callback hooks that can be used to handle different kinds of events. Using nl_socket_modify_cb we register a custom callback (NL_CB_CUSTOM) callback_message that will be called for all valid messages received from the kernel (NL_CB_VALID)
  2. nlmsg_alloc allocs a struct to hold the message that will be sent
  3. genlmsg_put sets the messsage header: NL_AUTO_PID and NL_AUTO_SEQ tells libnl to fill in the message sequence and pid number, required by the protocol; family is the taskstats family id; NLM_F_REQUEST indicates that this message is a request; TASKSTATS_CMD_GET is the command that we are sending to the taskstats interface, meaning that we want to get some information and TASKSTATS_VERSION is used by the kernel to be able to handle different versions of this interface
  4. nla_put_u32 sets an attribute TASKSTATS_CMD_ATTR_PID, which indicates that we are asking for the taskstats information of a particular pid, provided as the header value

3. Sending the message

if ((err = nl_send_sync(sk, msg)) < 0) {
    fprintf(stderr, "Error sending message: %s\n", nl_geterror(err));
    exit_code = 1;
    goto teardownMsg;
}

if ((err = nl_recvmsgs_default(sk)) < 0) {
    fprintf(stderr, "Error receiving message: %s\n", nl_geterror(err));
    exit_code = 1;
    goto teardownMsg;
}
  1. nl_send_sync sends a message using the socket and waits for an ack or an error message
  2. nl_recvmsgs_default waits for a message; this will block until the message is parsed by our callback

4. Receiving the response

Handling of the response is done by the callback_message function:

int callback_message(struct nl_msg *nlmsg, void *arg) {
    struct nlmsghdr *nlhdr;
    struct nlattr *nlattrs[TASKSTATS_TYPE_MAX + 1];
    struct nlattr *nlattr;
    struct taskstats *stats;
    int rem, answer;

    nlhdr = nlmsg_hdr(nlmsg);

    if ((answer = genlmsg_parse(nlhdr, 0, nlattrs, TASKSTATS_TYPE_MAX, NULL)) < 0) {
        fprintf(stderr, "error parsing msg\n");
        return -1;
    }

    if ((nlattr = nlattrs[TASKSTATS_TYPE_AGGR_PID]) || (nlattr = nlattrs[TASKSTATS_TYPE_NULL])) {
        stats = nla_data(nla_next(nla_data(nlattr), &rem));
        print_delayacct(stats);
    } else {
        fprintf(stderr, "unknown attribute format received\n");
        return -1;
    }
    return 0;
}
  1. nlmsg_hdr returns the actual message header from nlmsg
  2. genlmsg_parse parses a generic netlink message and stores the attributes to nlattrs
  3. we retrieve the attribute we are interested: TASKSTATS_TYPE_AGGR_PID
  4. nla_data returns a pointer to the payload of the message, we need to use nla_next because the taskstats data is actually returned on the second attribute (the first one being used just to indicate that a pid/tid will be followed by some stats)
  5. print_delayacct is used to finally print the data; this function is the same used by the linux example.

Delay examples

Let’s try to visualize some of the delay types be crafting some examples and running getdelays.

CPU scheduling delay

In this example I’m going to use the stress utility to generate some workload on a VM that has 2 cores. Using the -c <N> flag, stress creates <N> workers (forks) running sqrt() to generate some CPU load. Since this VM has two cores, I will spin two instance of stress with 2 workers each. By using the nice command, I’ll configure the niceness of the first instace to be 19, meaning that it will have a lower priority on the scheduling:

$ sudo nice -n 19 stress -c 2 & sudo stress -c 2
stress: info: [15718] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd
stress: info: [15719] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd

We can check with ps that we have now 6 processes running stress, the two parents and their two forks:

root     15718  0.0  0.0   7480   864 pts/2    SN   14:24   0:00 stress -c 2
root     15719  0.0  0.0   7480   940 pts/2    S+   14:24   0:00 stress -c 2
root     15720  1.4  0.0   7480    92 pts/2    RN   14:24   0:01 stress -c 2
root     15721  1.4  0.0   7480    92 pts/2    RN   14:24   0:01 stress -c 2
root     15722 96.3  0.0   7480    92 pts/2    R+   14:24   2:00 stress -c 2
root     15723 99.0  0.0   7480    92 pts/2    R+   14:24   2:03 stress -c 2

With getdelays we can check their CPU delays (output truncated):

$ ./getdelays -d -p 15722
PID	15722
CPU             count     real total  virtual total    delay total  delay average
                 3386   130464000000   132726743949     4190941076          1.238ms

$ ./getdelays -d -p 15723
PID	15723
CPU             count     real total  virtual total    delay total  delay average
                 3298   136240000000   138605044896      550886724          0.167ms

$ ./getdelays -d -p 15720
PID	15720
CPU             count     real total  virtual total    delay total  delay average
                  533     2060000000     2084325118   142398167037        267.164ms

$ ./getdelays -d -p 15721
PID	15721

CPU             count     real total  virtual total    delay total  delay average
                  564     2160000000     2178262982   148843119281        263.906ms

Clearly, the ones from with high niceness value are experience higher delays (the average delay is around 200x higher). If we ran both instances of stress with the same niceness, we will experience the same average delay accross then.

Block I/O delay

Let’s try to experience some I/O delays running a task. We can leverage docker to limit the I/O bps for our process using the --driver-write-bps flag on docker run. First, let’s run dd without any limits:

docker run --name dd --rm ubuntu /bin/dd if=/dev/zero of=test.out bs=1M count=8096 oflag=direct

The following screenshot shows the result obtained by running getdelays on the dd process:

root@ubuntu-xenial:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2904
print delayacct stats ON
PID	2904


CPU             count     real total  virtual total    delay total  delay average
                 6255     1068000000     1879315354       22782428          0.004ms
IO              count    delay total  delay average
                 5988    13072387639              2ms
SWAP            count    delay total  delay average
                    0              0              0ms
RECLAIM         count    delay total  delay average
                    0              0              0ms

We can see that we are getting an average of 2ms delays for I/O.

Now, let’s use --driver-write-bps to limit I/O to 1mbs:

docker run --name dd --device-write-bps /dev/sda:1mb --rm ubuntu /bin/dd if=/dev/zero of=test.out bs=1M count=8096 oflag=direct

The following screenshot shows the result of running getdelays on the process:

root@ubuntu-xenial:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2705
print delayacct stats ON
listen forever
PID	2705


CPU             count     real total  virtual total    delay total  delay average
                   71       28000000       32436630         600096          0.008ms
IO              count    delay total  delay average
                   15    40163017300           2677ms
SWAP            count    delay total  delay average
                    0              0              0ms
RECLAIM         count    delay total  delay average
                    0              0              0ms

Since I/O is limited, dd takes much more time to write its output, we can see that our I/O delay average is 1000 times higher than before.


Side note: using --driver-write-<bps,iops> docker flags uses linux cgroups v1 and those are only able to limit the amount of I/O if we open the files with O_DIRECT, O_SYNC or O_DSYNC flags, but this deserver a blog post on its own.


Memory reclaim delay

In this example we can use, once more, the stress utility by using the --vm <N> flag to launch N workers running malloc/free to generate some memory allocation workload. Once again, this VM has 2 cores.

Using the default --vm-bytes, which is 256M, I was able to experience some delay on memory reclaim by running more than 2 workers. But the delay average was kept fairly small, below 1ms:

PID	15888
CPU             count     real total  virtual total    delay total  delay average
                 2799    38948000000    39507647880    19772492888          7.064ms
RECLAIM         count    delay total  delay average
                   11         278304              0ms

PID	15889
CPU             count     real total  virtual total    delay total  delay average
                 3009    38412000000    38904584951    20402080112          6.780ms
RECLAIM         count    delay total  delay average
                   22       16641801              0ms

PID	15890
CPU             count     real total  virtual total    delay total  delay average
                 2954    39172000000    39772710066    19571509440          6.625ms
RECLAIM         count    delay total  delay average
                   39        9505559              0ms

Since the 3 tasks are competing on a 2 core CPU, the CPU delays were much higher. Running with --vm-bytes with lower values produced even lower memory reclaim delays (in some cases, no delay is experienced).

Linux delays on higher level tools

Not many tools expose linux delays to the end user, but those are available on cpustat. I’m currently working on a PR to get them on htop.

Implementing a simple sudo

One of the exercises of “The Linux Programming Interface” chapter 38 is about implementing program, called douser, that should have the same functionality as the sudo progam. This means that if you run $ douser whoami, this should asks for the root user password and, if the password matches, should run the command whoami, which would print root. If the -u <user> flag is used, douser should ask for the user password and execute whoami on behaf of that user, printing its name.

To be able to authenticate users on the system, our program must read the /etc/shadow file, which is only readable by the root user. This means that it must run as root. But it would be pretty bad if we needed to know the root password to run execute a command as an unprivileged user, e.g, $ douser -u ubuntu ls. That is where the set-user-id permission bit comes to our rescue.

Set-User-ID programs

A Set-User-ID program sets the process effective user ID to the same as the user ID that owns the executable file. So it does not matter what user executes the program, the process will always run as the owner of the executable.

If we inspect our sudo binary we can see that the set-user-ID permission bit is set on the file:

$ ls -l /usr/bin/sudo
-rwsr-xr-x 1 root root 155008 Oct 14  2016 /usr/bin/sudo

We can see that it shows an s instead of x on the execution permission bit.

The implementation

My full implementation is available on github. In this section I will discuss the most relevant parts of it.

Authenticating

User authenticating is the most sensitive part of our program and is handled by the authenticate function. This function uses getpwnam(username) to obtain a struct contained the fields available at the /etc/passwd file on linux for the user with that particular username.

The user password is actually stored in a different file (/etc/shadow), read only by the root user. We use the function getspnam(username) to obtain a struct representing the data on that file.

Then we use the function getpass to prompt the user for the password and the function crypt to encrypt the password. If the input password and the real password matches, authenticate returns 0.

/*
    authenticate prompts the user for the password
    and validates it, returning 0 on success.
*/
int 
authenticate(char *username) 
{
    struct spwd *spwd;
    struct passwd *pwd;
    char *encrypted;
    Boolean authOk;

    pwd = getpwnam(username);
    if (pwd == NULL) 
        fatal("could not get password file");

    spwd = getspnam(username);
    if (spwd == NULL && errno == EACCES)
        fatal("no permission to read shadow password file");
    
    if (spwd != NULL)
        pwd->pw_passwd = spwd->sp_pwdp;
    
    encrypted = crypt(getpass("Password:"), pwd->pw_passwd);

    return strcmp(encrypted, pwd->pw_passwd);
}

Executing the user provided command

After authenticating the user, we use the setuid syscall to set the effective user ID of the running process to the ID of the authenticated user.

Following that, we use fork to create a child process that will have its text segment replaced by using the execvpe library function. We replace the environment variables setting $PATH to known safe locations. The parent process exits right after forking.

pwd = getpwnam(username);
if (pwd == NULL)
    fatal("unable to retrieve user info");

if (setuid(pwd->pw_uid) != 0)   
    errExit("setuid");

pid = fork();
if (pid == -1) {
    errExit("fork");
} else if (pid != 0) {
    execvpe(argv[optind], &argv[optind], envp);
    errExit("execvpe");
}
exit(EXIT_SUCCESS);

Usage

After compiling the code, one must login as root and properly set the douser binary using chown root:root ./douser and chmod u+s ./douser. The latter turns on the set-user-ID permission bit on the executable.

After the setup, run ./douser whoami and it should print root. Woot!

Disclaimer

The implementation is far from being a complete copy of sudo (that is probably obvious but wanted to make it clear). The real sudo implementation can be read here.

I decided to share my implementation and some of my reasoning as a way to both share the knowlged and also enforce it. Would love to get some feedback.

Killing a container from the inside

This past week a tsuru user was having a problem in one of his apps. Some units of his app would simply get stuck, timing out every request and apparently doing nothing. As a workaround he wanted to kill the problematic unit and force it to restart. On tsuru you are able to restart all units of an app but may not choose a particular unit to restart.

We then “sshed” into the problematic container, by using tsuru app shell -a <app-name>, and tried sending a SIGKILL to their application process (pid 1) and surprisingly it did not work.

$ sudo kill -9 1 # our first try, from inside the container

We tried SIGTERM and SIGQUIT and nothing happened. We then ssh’ed into the host, found out the pid (using docker top), issued the SIGKILL and boom the container restarted.

Reading the man page for kill(2) helped understanding this behavior:

The only signals that can be sent to process ID 1, the init process, are those for which init has explicitly installed signal handlers. This is done to assure the system is not brought down accidentally.

So, to be able to kill the container from the inside, you need to register a handler for the particular signal. It turns out that you cannot register a handler for SIGKILL (you are also not able to ignore this signal). So, one must handle a different signal, e.g, SIGTERM, and use it to shutdown the application (by raising a SIGKILL or simply exiting).

The following code shows an example that might be used to check this behavior.

#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>

void handler(int sig)
{
	exit(sig);
}


int main(int argc, char *argv[])
{
	int duration;
	if (argc > 1)
	{
		duration = atoi(argv[1]);
		printf("Sleeping for %ds\n", duration);
		sleep(duration);
		exit(EXIT_SUCCESS);
	}
	if(signal(SIGQUIT, handler) == SIG_ERR)
		exit(EXIT_FAILURE);

	for (;;)
		pause();
}

If the code is run as ./killable 30, the application will sleep for 30 seconds and then just exit. If that is the init process of a container, you won’t be able to send any signal to it as no handler was registered. If no argument is provided, a handler for the SIGQUIT signal is registered and we are able to send signals to it. In this latter case, we are able to kill the container successfully.

As it turns out, our advice to the user was to setup a signal handler for SIGTERM and to shutdown the application when receiving this signal.

Implementing malloc and free


Chapter 7 of “The Linux Programming Interface” is about memory allocation. One of the exercises, marked as advanced, asks the reader to implement malloc. I’ve decided to give it a shot. My full implementation is available on Github. I will try to break down some of my reasoning on the next sections and include some snippets from the code.

Memory layout of a Process

The memory allocated to each process is composed of multiple segments, as can be seen on the following image: process layout We are particularly interested on the heap (also known as the data segment), an area from which memory can be dynamically allocated at run time. The top end of the heap is called the program break.

Adjusting the program break

We can move the program break on our C program by using sbrk() and brk().

int brk(void *addr);
void *sbrk(intptr_t increment);

The first, moves the program break to the address pointed by addr, while the latter increments the program break by increment bytes. Their man pages can give more information about their implementation on Linux and other systems, but those are the basic building blocks that our malloc implementation will rely on. On Linux, sbrk relies on brk.

The implementation

The entire code for the implementation is available at github. Beware that this implementation is full of bugs (some are discussed below, some are obvious from reading the real malloc implementation). The following code is the malloc function implementation:

void           *
_malloc(size_t size)
{
    void           *block_mem;
    block_t        *ptr, *newptr;
    size_t		alloc_size = size >= ALLOC_UNIT ? size + sizeof(block_t)
        : ALLOC_UNIT;
    ptr = head;
    while (ptr) {
        if (ptr->size >= size + sizeof(block_t)) {
            block_mem = BLOCK_MEM(ptr);
            fl_remove(ptr);
            if (ptr->size == size) {
                // we found a perfect sized block, return it
                return block_mem;
            }
            // our block is bigger then requested, split it and add
            // the spare to our free list
            newptr = split(ptr, size);
            fl_add(newptr);
            return block_mem;
        } else {
            ptr = ptr->next;
        }
    }
    /* We are unable to find a free block on our free list, so we
        * should ask the OS for memory using sbrk. We will alloc
        * more alloc_size bytes (probably way more than requested) and then
        * split the newly allocated block to keep the spare space on our free
        * list */
    ptr = sbrk(alloc_size);
    if (!ptr) {
        printf("failed to alloc %ld\n", alloc_size);
        return NULL;
    }
    ptr->next = NULL;
    ptr->prev = NULL;
    ptr->size = alloc_size - sizeof(block_t);
    if (alloc_size > size + sizeof(block_t)) {
        newptr = split(ptr, size);
        fl_add(newptr);
    }
    return BLOCK_MEM(ptr);
}

Our implementation keeps a doubly linked listed of free memory blocks and every time _malloc gets called, we traverse the linked list looking for a block with at least the size requested by the user (lines 8–25). If a block with the exact requested size exists, we remove it from the list and return its address to the user (lines 11–16); if the block is larger, we split it into two blocks, return the one with the requested size to the user and adds the newly created block to the list (lines 19–21). If we are unable to find a block on the list, we must “ask” the OS for more memory, by using the sbrk function (lines 31–35). To reduce the number of calls to sbrk, we alloc a fixed number of bytes that is a multiple of the memory page size, defined as:

#define ALLOC_UNIT 3 * sysconf(_SC_PAGESIZE)

After the call to sbrk (where our program break changes value) we create a new block with the allocated size. The metadata on this block contains the size, next and previous blocks and is allocated on the first 24 bytes of the block (this is our overhead) (lines 36–38). Since we may have allocated much more memory then the user requested, we split this new block and return the one with the exact same size as requested (lines 39–43). The BLOCK_MEM macro, defined as:

#define BLOCK_MEM(ptr) ((void *)((unsigned long)ptr + sizeof(block_t)))

skips the metadata at given ptr and returns the address of the memory area that is available for the user. The _free function is quite straightforward, given a pointer that was previously “malloced” to the user, we must find its metadata (by using the BLOCK_HEADER macro) and add it to our free linked list. After that, the function scan_merge() is called to do some cleaning:

/* scan_merge scans the free list in order to find
 * continuous free blocks that can be merged and also
 * checks if our last free block ends where the program
 * break is. If it does, and the free block is larger then
 * MIN_DEALLOC then the block is released to the OS, by
 * calling brk to set the program break to the begin of
 * the block */
void
scan_merge()
{
    block_t        *curr = head;
    unsigned long	header_curr, header_next;
    unsigned long	program_break = (unsigned long)sbrk(0);
    if (program_break == 0) {
        printf("failed to retrieve program break\n");
        return;
    }
    while (curr->next) {
        header_curr = (unsigned long)curr;
        header_next = (unsigned long)curr->next;
        if (header_curr + curr->size + sizeof(block_t) == header_next) {
            /* found two continuous addressed blocks, merge them
             * and create a new block with the sum of their sizes */
            curr->size += curr->next->size + sizeof(block_t);
            curr->next = curr->next->next;
            if (curr->next) {
                curr->next->prev = curr;
            } else {
                break;
            }
        }
        curr = curr->next;
    }
    stats("after merge");
    header_curr = (unsigned long)curr;
    /* last check if our last free block ends on the program break and is
     * big enough to be released to the OS (this check is to reduce the
     * number of calls to sbrk/brk */
    if (header_curr + curr->size + sizeof(block_t) == program_break
        && curr->size >= MIN_DEALLOC) {
        fl_remove(curr);
        if (brk(curr) != 0) {
            printf("error freeing memory\n");
        }
    }
}

scan_merge() first traverses the linked list looking for continuous blocks (two different memory blocks that are free and correspond to continuous addresses). We keep the blocks sorted by address to make this step easier. For every two continuous blocks found, we merge both blocks to reduce our total overhead (less metadata to keep) (lines 18–33).

After finding the last block on our free list, we check if this blocks ends on the program break (line 39). If that is true, and the block is big enough (where “big” is defined as MIN_DEALLOC, also a multiple of the page size), we remove the block from our list and move the program break to the beginning of the block, by calling brk.

How is malloc actually implemented?

Before diving into the real malloc code, I decided to write a simple test program and trace it’s execution using strace. I used the following code:

#include <malloc.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
 malloc(atoi(argv[1]));
}

I decided to trace the executing testing different sizes.

$ strace ./malloc 1
...
brk(NULL)                               = 0x5585209f2000
brk(0x558520a13000)                     = 0x558520a13000
exit_group(0)                           = ?
+++ exited with 0 +++
$ strace ./malloc 100000
...
brk(NULL)                               = 0x55b45a386000
brk(0x55b45a3bf000)                     = 0x55b45a3bf000
exit_group(0)                           = ?
$ strace ./malloc 1000000
...
mmap(NULL, 1003520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f05f7cbf000
exit_group(0)                           = ?

The first thing I noticed between executions is that when asking for a large size, malloc actually used mmap instead of brk to allocate the memory. I wasn’t sure why, since I have yet to study mmap. But checking the code for glibc’s malloc I’ve found a pretty nice writeup made by the authors explaining their implementation. I highly recommend reading it. Regarding the use of mmap, from the source code comments:

This backup strategy generally applies only when systems have “holes” in address space, so sbrk cannot perform contiguous expansion, but there is still space available on system. On systems for which this is known to be useful (i.e. most linux kernels), this occurs only when programs allocate huge amounts of memory. Between this, and the fact that mmap regions tend to be limited, the size should be large, to avoid too many mmap calls and thus avoid running out of kernel resources.

The documentation also explains their design goals, motivations, how the blocks are organized (by using bins for the different sizes, instead of keeping a linked list sorted by address, as I did) and lots of other details. I learned a lot reading it.

Adding a call to free on my test program does not change the syscalls made by at, as the memory is not released to the OS (some people rely on this behavior “mallocing” a large chunk of memory and freeing it on the start of a program to reserve the memory). One can control this behavior by defining M_TRIM_THREASHOLD:

M_TRIM_THRESHOLD is the maximum amount of unused top-most memory to keep before releasing via malloc_trim in free(). Automatic trimming is mainly useful in long-lived programs. Because trimming via sbrk can be slow on some systems, and can sometimes be wasteful (in cases where programs immediately afterward allocate more large chunks) the value should be high enough so that your overall system performance would improve by releasing this much memory.


This blog post is part of a series of posts that I intent to write while reading “The Linux Programming Interface” to make sure I’m actually learning something, as a way to practice and to share knowledge.

Sparse files

Following my read of “The Linux Programming Interface”, I came across the concept of “file with holes” or “sparse files”. Those are files that try to use the file system more effectively by preventing it from using disk space when sections of the file are empty. The disk storage is only actually used when needed. Sparse files are really useful for backup utilities, disk images, database snapshots etc.

One way to create a sparse file is by using truncate:

$ truncate -s 10M testfile

We can check the apparent size of the file with the following command:

$ du --apparent-size testfile
10240 testfile

As stated on the man page for du, the apparent size is not the actual disk usage; it may be larger due to holes in sparse files, fragmentation, indirect blocks etc.

Now, let’s check for the actual disk usage:

$ du -h testfile
0 testfile

Note: the apparent size and the real disk usage for the file you created have the same value, your filesystem does not support sparse files. The archlinux wiki has a nice article about sparse files.

After introducing the topic, the “The Linux Programming Interface” book has an exercise on building a copy of the cp command that is able to create sparse files when the original file was sparse. The entire code for my implementation can be found here.

The real cp shell command has an heuristic to detect and keep the sparsity of the original file on to the target file (we can disable it with the flag sparse=never). Our cp copycat is going to be really simple: we will skip any ‘\0’ byte (that is how the holes are represented when reading file) by using lseek on the destination file.

Copying a file with holes

Surprisingly, while reading past the end of file results on and EOF error, writing past that does not result on an error. To create “holes” in a file, in c, all we need to do is use the lseek syscall to go beyond the end of the file.

First, let’s create a file with some content, followed by a hole and then some more content. This can be accomplished with the following c code:

#include <fcntl.h>
#include "tlpi_hdr.h"
#include "error_functions.c"
#include "get_num.c"

int main(int argc, char *argv[])
{
    int outputFd, openFlags;
    mode_t filePerms;
    off_t offset;

    if (argc < 2 || strcmp(argv[1], "--help") == 0)
        usageErr("%s file <text> hole-length <text>\n", argv[0]);
    
    openFlags = O_CREAT | O_WRONLY | O_TRUNC;
    filePerms = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP |
                S_IROTH | S_IWOTH; /* rw-rw-rw- */

    outputFd = open(argv[1], openFlags, filePerms);
    if (outputFd == -1)
        errExit("open");

    if (write(outputFd, &argv[2][0], strlen(&argv[2][0])) != strlen(&argv[2][0]))
        errExit("failed to write whole buffer");

    offset = getLong(&argv[3][0], GN_ANY_BASE, &argv[3][0]);

    if (lseek(outputFd, offset, SEEK_CUR) == -1)
        errExit("lseek");

    if (write(outputFd, &argv[4][0], strlen(&argv[4][0])) != strlen(&argv[4][0]))
        errExit("failed to write whole buffer");

    if (close(outputFd) == -1)
        errExit("close output");

    exit(EXIT_SUCCESS);
}

To use this script simply invoke:

$ ./hole testfile "begin" 100000 " end"

This will create a file, “testfile”, write “begin” right at the start, skip 100000 bytes and write “ end” at that offset.

My implementation of cp can be viewed here, but the interesting part is the following code:

int i, holes = 0;
while ((numRead = read(inputFd, buf, BUF_SIZE)) > 0) 
{
    if (keepHoles == 1) {
        for (i = 0; i < numRead; i++) {
            if (buf[i] == '\0') {
                holes++;
                continue;
            } else if (holes > 0) {
                lseek(outputFd, holes, SEEK_CUR);
                holes = 0;
            }
            if (write(outputFd, &buf[i], 1) != 1)
                fatal("couldnt write char to file");
        }
    } else {
        if (write(outputFd, buf, numRead) != numRead)
            fatal("couldn't write whole buffer to file");
    }
}

If we go for the naive approach (without the -k flag), just writing to the output file the exact bytes read from the original file, we will create a dense file. By passing the flag -k, we are able to keep the sparsity of the original file by using lseek(2) syscall to skip the bytes represented by \0.

Running our cp implementation on our test file results in the following results:

$ ./cp testfile testfileDense
$ ./cp -k testfile testfileSparse
$ du -h testfile testfileSparse testfileDense
8,0K testfile
8,0K testfileSparse
100K testfileDense

Great! By using lseek we successfully kept the same disk usage on our copy file!


This blog post is part of a series of posts that I intent to write while reading “The Linux Programming Interface” to make sure I’m actually learning something, as a way to practice and to share knowledge.

reboot(2) syscall and agic numbers

I’m currently reading “The Linux Programming Interface” and I intend to use this blog as a way to share what i’m learning (to make sure I’m actually learning).

In one of the first chapters there is an exercise about the reboot(2) syscall and some magic numbers that are used as parameters.

The reboot(2) syscall has the following signature:

int reboot(int magic, int magic2, int cmd, void *arg);

As can be read from the man page, the first parameter, magic, must be set to LINUX_REBOOT_MAGIC1, that is 0xfee1dead(nice one). But the mistery lies on the second parameter, magic2, which must be set to one of the following constants:

LINUX_REBOOT_MAGIC2 (that is, 672274793)
LINUX_REBOOT_MAGIC2A (that is, 85072278)
LINUX_REBOOT_MAGIC2B (that is, 369367448)
LINUX_REBOOT_MAGIC2C (that is, 537993216)

As the book and the man page states, converting these numbers to hexadecimal can give some clue on their meaning. We can use the following Go code(also available on go playground) to get their hexadecimal values.

package main

import (
	"fmt"
	"strconv"
)

func main() {
	consts := []int64{672274793, 85072278, 369367448, 537993216}
	for _, c := range consts {
		fmt.Printf("%d: %s\n", c, strconv.FormatInt(c, 16))
	}
}

Which prints:

672274793: 28121969
85072278: 5121996
369367448: 16041998
537993216: 20112000

At first glance those looked just like any random number for me, but looking a little closer I realized they are actually dates: 28/12/1969, 05/12/1996, 16/04/1998 and 20/11/2000. But what those dates mean?

The first date, 28/12/1969, was the day Linus Torvalds was born. The others all refer to his daughters birthdays. Cute.

What are those magic numbers for?

As stated on linux/reboot.c, they are merely used to make sure one does not reboot the machine by mistake:

* Reboot system call: for obvious reasons only root may call it, 
* and even root needs to set up some magic numbers in the registers * so that some mistake won't make this reboot the whole machine.