I’ve decided to learn how the Go runtime allocates memory and this is the first in a series of posts that are going to cover what I learned (and also the process itself: how I learned). These posts are based of my notes for the talk I gave at Gophercon UK in 2018:
The objective of this post is to give a broad idea of how allocation works on the runtime, without actually showing any of it’s code. The following posts on the series will be more dense and with lots of code samples extracted from the runtime. Each of the future posts will cover an specific detail of the allocator.
Disclaimer: The code samples displayed in this series of posts were redacted to simplify the reading. The full code is available on golang/go repository. This post is based of the 1.10 branch. In the last post of the series, we are going to cover the main changes in go 1.11.
Disclaimer 2: I’m not a go runtime expert. Just someone curious and interested in those kinds of things. Feel free to reach me out and correct me.
Disclaimer 3: Most of the content in these posts are implementation details and are not defined on the Go Specification. There are no guarantees that they are going to stay the same, approach all this with caution. For instance, the Go Spec says nothing about heap or stack.
0. Memory allocation 101
Before reading this post, I recommend reading my previous post on memory allocation. That post covers the basics needed to understand what we are going to cover in this series.
1. How the allocator gets invoked
In languages like C, it’s pretty obvious when the memory allocator is invoked: that happens when you can malloc (or one of its variations) and free.
We know that Go has no malloc function and also you are not able to actively free memory allocations because the language is garbage collected. So, one may wonder how the memory allocation code is invoked in our Go programs.
Imagine we have the following Go code:
package main
func main() {
f()
}
//go:noinline
func f() *int {
i := 10
return &i
}
The f() function is marked //go:noinline to prevent the compiler from inlining the call to f(), removing the function call and making the example a little less trivial to understand.
The compiler uses a technique called escape analysis to choose wether a variable is going to be placed on the heap or the stack. We are interested in heap allocations and that is exactly where i is allocated. We can see that by running go build passing -gcflags "-m -m":
The -gcflags flag is way to pass flags to the go compiler during the build and the "-m" flag tells the compiler to output optmization decisions (passing it two times increases verbosity). One can also use "-m=2".
$ go build -gcflags "-m -m" main.go
# command-line-arguments
./main.go:8:6: cannot inline f: marked go:noinline
./main.go:3:6: cannot inline main: non-leaf function
./main.go:10:9: &i escapes to heap
./main.go:10:9: from ~r0 (return) at ./main.go:10:2
./main.go:9:2: moved to heap: i
Imagine if i was allocated on the stack. That means that &i would be the address of a memory area on f()’s stack frame. When f() returns, it’s stack frame becomes invalid memory and thus main() can’t access any of it. Since f() is using pointer semantics to return the address i the compiler figures out that i should live on the heap, since it must be allocated on a memory area that remains valid even after f() returns.
So, we know that i is going to be allocated on the heap. But how does the runtime set that up? With the compiler’s help! We can get an idea from reading the generated assembly.
The following output was redacted for brevity, showing only the relevant parts. Some resources on Go assembly are linked on the end of the post.
mallogc is the entrypoint to the runtime memory allocation algorithm. It receives the size, type and wether or not this memory should return zeroed (true in this case).
There are other places mallocgc gets called (e.g runtime.makeslice etc). We are going to cover mallocgc in details in the next posts, all we need to know is that it does the heap allocation work and is a implementation of the algorithm we are going to cover at a high level now.
The Go runtime allocator was initially based on tcmalloc but has diverged quite a bit.
The page tcmalloc page, also linked in the code comments, has a great overview of how the algorithm works.
I’ll briefly summarize tcmalloc in this section. I recommend reading the refered page as a way to get into more details and also checking some of the benchmarks comparing tcmalloc with glibc malloc.
We can divide allocations into 2 buckets: small and large allocations. Most of the allocations are served from local thread caches, meaning that no locks are required, reducing contention and increasing the overall performance.
2.1 Small allocations
Small allocations are allocations with size <= 32kB.
The requested sized is rounded up to one of the 170 size classes; e.g, allocations in the range 961 to 1024 bytes are rounded up to 1024.
tcmalloc looks up at the local thread cache to see if there are free objects available for this size class:
If there are free objects, it returns the first one from the list and we are all done. This allocation required no locks as it was totally served from the local thread cache.
If there are no free objects, fetch new objects from the central cache.
The central cache checks if it has free objects:
If there are free objects, return a bunch to the local cache;
If there are no free objects, request a chunck of memory pages from the central heap, split it into objects for this size class and return them to the local cache.
2.2 Large allocations
Large allocations are served directly by the central heap. They are rounded up to a page size 4K and a run of contiguous pages are called Spans.
The central heap maintains a array of free lists. For k < 256, the kth entry is a free list of Spans that consist of k pages. The 256th entry is a free list of Spans that have length >= 256 pages.
the central heap will search for a Span on the k-th free list
If one is found, return it to the user
If one is not found, look for it on the k+1-th free list, and so on…
If a free Span is not found in any of the free lists, allocate a large chunk of memory from the OS
If the Span found, or the memory requested from the OS, is larger than the user requested, split it into multiple spans and place then on the free lists before returning to the user.
3. Overview of the Go’s Runtime Allocator
3.1 Allocations
We can divide Go heap allocations into 3 groups (as opposed to the 2 groups in tcmalloc). Each of these types are going to be covered in separated posts on this series.
Small allocations (< 32 kB)
Small allocations in Go are pretty similar to tcmalloc. It follows from the most local to the most central structure:
mcache -> mcentral -> mheap
Tiny Allocations (< 16 B and no pointers)
Follows the same path as small allocations but is stored, managed and freed as 16 B memory blocks by the mcache.
Large Allocations (> 32 kB)
These are served directly by the mheap, first by checking its free lists and asking for more memory from the OS when needed.
3.2 Data structures
In this section we are going to get a high level view of how each of the mentioned data structures operate.
The following diagram displays the main components that are going to be described later:
We can see that each logical processor (P) has its own mcache, which is the local thread cache.
There is a single mheap, which plays the part of the central heap.
Surprisinly, there are multiple mcentral objects. There is one for each of the size classes. This means that even when requesting for a span from the mcentral we can reduce contention by only acquiring a lock for the requested size class. Goroutines in other processors that also require allocating from the mcentral but are requesting objects for a different size class won’t find that mcentral locked.
Thread Cache (mcache)
The thread cache is implemented by the mcache structure. There is one mcache for each P (logical processor).
The mcache maintains a mspan of each class (for objects with 8 bytes up to 32 kB). Each of these spans are able to allocate multiple objects of that size. If the span for a given class is full, the mcache requests a new one from the corresponding mcentral structure for that size class.
Since it is a per-P structure, there is no need to hold locks when allocating from the mcache which is a interesting optimization.
The mcache also maintains a block that is used for tiny allocations (this is a concept that is not on tcmalloc) of objects that contain no pointers. When are going to cover how this works on a post about Tiny Allocation in go.
Central free list (mcentral)
The central free list is implemented by mcentral. There is one mcentral for each span class sizes and they are stored inside the mheap struct, which we will cover soon.
Each mcentral holds a lock that needs to be acquired when allocating from it, since mcentral is shared by all Ps.
It maintains two double linked lists (of type mSpanList), one for spans that contain free objects and one for spans that don’t contain any free object or that are already in a mcache.
When a mcache requests a new span, the mcentral traverses both lists looking for spans that are available and can be handled to the mcache.
If there is no span available, the mcentral requests for a new span from the central heap (mheap) and returns this new span to the mcache.
Central Heap (mheap)
The central heap is implemented by the type mheap struct{}. Large objects (> 32 kB) are directly allocated from it. Since this structure is shared by all threads, allocating from it incurs in locking. This is the largest structure on the allocator so we are going to break it down.
The mheap, as on tcmalloc, maintains an array of free linked lists. The i-th item in the array contains runs of pages that consist of i pages.
The mheap also maintains a mtreap with runs of pages that are larger then 127 pages. This used to be a linked list, but changed in this commit due to performance on large heaps.
A treap is a probabilistically balanced binary tree that offers O(logN) behavior for inserting and removing spans. The tree is kept sorted both by span size and the span address is used as second sorting key.
When asked for n pages of memory, the mheap traverses its free linked lists for spans greater or equal to n. In case there are no free spans, the mheap asks the OS for more. In linux, it does so by using the mmap(2) syscall to ask for at least 1MB. Contiguous free spans are subject to coalescing and are placed on the corresponding free lists.
The mheap maintains a lookup table of virtual page address IDs to span. A large enough area to map every page that can be allocated is reserved upon initialization. This lookup table can be used to find the span responsible for a given page and is the “central array” described in tcmalloc’s page.
If the span is allocated, all pages map to the span itself.
Pages that have never been allocated are nil entries in this lookup table.
The mheap also maintains:
the arena, which is the range of addresses used for the Go heap. This is used by the mheap to know when to request more memory from the OS and at what virtual address this new mapping needs to start;
A bitmap used by the garbage collector to know what is still live and can’t be collected;
A bunch of fixed sized allocators that are used to manually allocate the structures used internally by the allocator.
Spans (mspan)
A span is of type mspan and represents a run of contiguous pages that are managed by the mheap. This is the object travels between the mheap and the caches (mcentral and mcache).
Every span has a startAddr which, as the name says, is the address of its first byte and a npages which is the number of pages that this span has.
For small allocations, which means that the span is of one of the given size classes, we can imagine the span as a container for nelems of that given size. The mcache needs to check if the span has still room for another object of that size.
For large allocations (> 32kB), the span holds a single object and a pointer to its base address (startAddr) is returned by mallocgc.
Wrapping Up
I think this is enough for a first, introductory post. On the next post we are going to start exploring the runtime code and see how it handles large object allocations.
Developers running their apps on tsuru can choose plans based on memory and cpu usage. We were looking at adding I/O usage as one of the plan’s features, so one could choose how many I/O operations per second or bytes per second a container may be able to read/write.
Being able to limit I/O is particulary important when running a diverse cloud, where applications with different workloads and needs are running together sharing multiple resources. We need to make sure that an application that starts to behave badly does not interfere with others.
Our main scheduler is a docker based and docker exposes some flags to limit a container IOPS (I/O operations per second) and BPS (bytes per second). Docker relies on a linux kernel feature, called cgroups, to be able to limit a process resource usage.
Before exposing this as a possible parameter on tsuru plans, I decided to investigate and do some experimentation using cgroups to limit a process I/O (IOPS or BPS). I came to a conclusion that currently, it is not possible to fulfill our needs and decided to delay the implementation.
In the next section, we are going to discuss cgroups, the main kernel feature used to limit resource usage.
Introduction to cgroups
Cgroups are a mechanism available in Linux to aggregate/partition a set of tasks and all their future children. Different subsystems may hook into cgroups to provide different behaviors, such as resource accounting/limiting (this particular kind of subsystem is called controller).
Cgroups (along with namespaces) are one building blocks of containers, but you don’t need a container runtime to make use of them.
Managing cgroups is done by interacting with the cgroup filesystem, by creating directories and writing to certain files. There are two versions of cgroups available in newest kernels: v1 and v2. Cgroups v2 completely changes the interface between userspace and the kernel and, as of today, container runtimes only support cgroups v1, so we will focus on v1 first.
Cgroups v1
Cgroups v1 has a per-resource (memory, blkio etc) hierarchy, where each resource hierarchy contains cgroups for that resource. They all live in /sys/fs/cgroup:
Each PID is in exactly one cgroup per resource. If a PID is not assigned to a specific cgroup for a resource, it is in the root cgroup for that particular resource. Important: Even if a cgroup has the same name in resourceA and resourceB they are considered distinct. Some of the available resources are:
blkio: Block IO controller - limit I/O usage
cpuacct: CPU accounting controller - accouting for CPU usage
cpuset: CPU set controller - allows assigning a set of CPUs and memory nodes to a set of tasks
hugeTLB: Huge TLB controller - allows limiting the usage of huge pages
devices: Device whitelist controller - enforce open and mknode restrictions on device files
pids: Process number controller - limit the number of tasks
In the next section, we are going to investigate the use of the blkio controller to limit the I/O bytes per second of a task running under a cgroup.
Limiting I/O with cgroups v1
To limit I/O we need to create a cgroup in the blkio controller.
$ mkdir -p /sys/fs/cgroup/blkio/g1
We are going to set our limit using blkio.throttle.write_bps_device file. This requires us to specify limits by device, so we must find out our device major and minor version:
$ cat /proc/partitions
major minor #blocks name
8 0 10485760 sda
8 1 10484719 sda1
8 16 10240 sdb
Let’s limit the write bytes per second to 1048576 (1MB/s) on the sda device (8:0):
We were able to write 429 MB/s; we can see from iostat output that all 512MB were writen in the same second. If we try the same command but opening the file with O_DIRECT flag (passing oflag=direct to dd):
$ dd if=/dev/zero of=/tmp/file1 bs=512M count=1 oflag=direct
1+0 records in
1+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 539.509 s, 995 kB/s
At the same time, we get this output from iostat (redacted):
Et voila! Our writes were below 1.0 MB/s. Why didn’t it work on the first try? Lets try to understand on the next section.
The I/O Path
So, whats the difference between opening the file with O_DIRECT and opening the file with no flags? The article Ensuring Data Reaches the Disk does a pretty good job explaining all the I/O flavors in Linux and their different paths on the kernel.
That data starts out as one or more blocks of memory, or buffers, in the application itself. Those buffers can also be handed to a library, which may perform its own buffering. Regardless of whether data is buffered in application buffers or by a library, the data lives in the application’s address space. The next layer that the data goes through is the kernel, which keeps its own version of a write-back cache called the page cache. Dirty pages can live in the page cache for an indeterminate amount of time, depending on overall system load and I/O patterns. When dirty data is finally evicted from the kernel’s page cache, it is written to a storage device (such as a hard disk). The storage device may further buffer the data in a volatile write-back cache. If power is lost while data is in this cache, the data will be lost. Finally, at the very bottom of the stack is the non-volatile storage. When the data hits this layer, it is considered to be “safe.”
Basically, when we write to a file (opened without any special flags), the data travels across a bunch of buffers and caches before it is effectively writen to the disk.
Opening a file with O_DIRECT (available since Linux 2.4.10), means (from man pages):
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers.
So, for some reason, when we bypassed the kernel’s page cache, the cgroup was able to enforce the I/O limit specified.
This commit in Linux adds some documentation that explains exactly what is happening.
On traditional cgroup hierarchies, relationships between different controllers cannot be established making it impossible for writeback to operate accounting for cgroup resource restrictions and all writeback IOs are attributed to the root cgroup.
It’s important to notice that this was added when cgroups v2 were already a reality (but still experimental). So the “traditional cgroup hierarchies” means cgroups v1. Since in cgroups v1, different resources/controllers (memory, blkio) live in different hierarchies on the filesystem, even when those cgroups have the same name, they are completely independent. So, when the memory page is finally being flushed to disk, there is no way that the memory controller can know what blkio cgroup wrote that page. That means it is going to use the root cgroup for the blkio controller.
Right below that statement, we find:
If both the blkio and memory controllers are used on the v2 hierarchy and the filesystem supports cgroup writeback writeback operations correctly follow the resource restrictions imposed by both memory and blkio controllers.
So, in order to limit I/O when this I/O may hit the writeback kernel cache, we need to use both memory and io controllers in the cgroups v2!
In cgroups v2 there is only a single hierarchy, instead of one hierarchy for resource. Supposing the cgroups v2 file system is mounted in `/sys/fs/cgroup/:
This hierarchy means that we may impose limits to both I/O and memory by writing to files in the cgroup1 cgroup. Also, those limits may be inherited by cgroup3. Not every controller supported in cgroups v1 is available in cgroups v2. Currently, one may use: memory, io, rdma and pids controller.
Let’s try the same experiment as before using cgroups v2 this time. The following example uses Ubuntu 17.04 (4.10.0-35-generic).
Disabling cgroups v1
First of all, we need to disable cgroups v1. To do that, I’ve added GRUB_CMDLINE_LINUX_DEFAULT="cgroup_no_v1=all" to /etc/default/grub and rebooted. That kernel config flag disables cgroup v1 for all controllers (blkio, memory, cpu and so on). This guarantees that both the io and memory controllers will be used on the v2 hierarchy (one of the requirements mentioned by the doc on Writeback mentioned earlier).
Limiting I/O
First, let’s mount the cgroups v2 filesystem in /cgroup2:
$ mount -t cgroup2 nodev /cgroup2
Now, create a new cgroup, called cg2 by creating a directory under the mounted fs:
$ mkdir /cgroup2/cg2
To be able to edit the I/O limits using the the I/O controller on the newly created cgroup, we need to write “+io” to the cgroup.subtree_control file in the parent (in this case, root) cgroup:
$ echo "+io" > /cgroup2/cgroup.subtree_control
Checking the cgroup.controllers file for the cg2 cgroup, we see that the io controller is enabled:
$ cat /cgroup2/cg2/cgroup.controllers
io
To limit I/O to 1MB/s, as done previously, we write into the io.max file:
$ echo "8:0 wbps=1048576" > io.max
Let’s add our bash session to the cg2 cgroup, by writing its PID to cgroup.procs:
$ echo $$ > /cgroup2/cg2/cgroup.procs
Now, lets use dd to generate some I/O workload and watch with iostat:
dd if=/dev/zero of=/tmp/file1 bs=512M count=1
1+0 records in
1+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 468.137 s, 1.1 MB/s
At the same time, we get this output from iostat (redacted):
So, even relying on the writeback kernel cache we are able to limit the I/O on the disk.
Wrapping Up
We’ve seen how limiting I/O is hard using cgroups. After some experimentation and research we found out that using cgroups v2 is the only way to properly limit I/O (if you can’t change your application to do direct or sync I/O).
As a result of this experiment, we decided to not limit I/O using docker at the moment since it does not support cgroups v2 (yet).
Ever wondered how long is your program spending while waiting for I/O to finish? Or if it is spending lots
of time while waiting for a turn to run on one of the cpus? Linux provides delay accounting information that may help answering these and other questions. Delay information is available for many types of resources:
waiting for a CPU (while being runnable)
completion of synchronous block I/O initiated by the task
swapping in pages
memory reclaim
These information is available in nanoseconds, on a per pid/tid basis, and is pretty useful to find out if your system resources are saturated by the number of concurrent tasks running on the machine. You can either: reduce the amount of work being done on the machine by removing unecessary processes or adjust the priority (cpu priority, io priority and rss limit) for important tasks.
Acessing delay accounting information
This information is available for userspace programs thru the Netlink interface, an interface a user-space program in linux uses to communicate with the kernel. It can be used by a bunch of stuff: managing network interfaces, setting ip addresses and routes and so on.
Linux ships with a source code example, getdelays, on how to build tools to consume such information [2]. By using ./getdelays -d -p <PID> we can visualize
the delay experienced by process while consuming different kinds of resources.
Side note: since this commit, Linux requires a process to run as root to be able to fetch delay accounting information. I plan to check up if these could be changed so an user may check delay information on any process owned by him/her.
getdelays states that “It is recommended that commercial grade applications use libnl or libnetlink and use the interfaces provided by the library”,
so I decided to rewrite part of getdelays using a higher level library, instead of having to handle parsing and other instrinsics of the netlink protocol.
Re-implementing getdelays using libnl
I found libnl to be a quite flexible library and was able to write this example in a couple of hours (and I didn’t have any prior experience with netlink). Their documentation on the Netlink protocol had everything I needed to understand the protocol.
The source code for my implementation is available on my github and uses libnl to “talk” netlink. In the following sections I`ll highlight the most important parts of the implementation.
we start by calling nl_socket_alloc() to allocate a netlink socket, required for the communication with the netlink interface
the call to nl_connect connects our socket to the NETLINK_GENERIC protocol (depending on our needs, we can use other protocols like NETLINK_ROUTE for routing operations)
gen_ctrl_resolve is used to obtain the family id of the taskstats. This is the “postal code” of the delay information holder
After the setup we are ready to prepare our netlink message.
Libnl offers a bunch of callback hooks that can be used to handle different kinds of events. Using nl_socket_modify_cb we register a custom callback (NL_CB_CUSTOM) callback_message that will be called for all valid messages received from the kernel (NL_CB_VALID)
nlmsg_alloc allocs a struct to hold the message that will be sent
genlmsg_put sets the messsage header: NL_AUTO_PID and NL_AUTO_SEQ tells libnl to fill in the message sequence and pid number, required by the protocol; family is the taskstats family id; NLM_F_REQUEST indicates that this message is a request; TASKSTATS_CMD_GET is the command that we are sending to the taskstats interface, meaning that we want to get some information and TASKSTATS_VERSION is used by the kernel to be able to handle different versions of this interface
nla_put_u32 sets an attribute TASKSTATS_CMD_ATTR_PID, which indicates that we are asking for the taskstats information of a particular pid, provided as the header value
nlmsg_hdr returns the actual message header from nlmsg
genlmsg_parse parses a generic netlink message and stores the attributes to nlattrs
we retrieve the attribute we are interested: TASKSTATS_TYPE_AGGR_PID
nla_data returns a pointer to the payload of the message, we need to use nla_next because the taskstats data is actually returned on the second attribute (the first one being used just to indicate that a pid/tid will be followed by some stats)
print_delayacct is used to finally print the data; this function is the same used by the linux example.
Delay examples
Let’s try to visualize some of the delay types be crafting some examples and running getdelays.
CPU scheduling delay
In this example I’m going to use the stress utility to generate some workload on a VM that has 2 cores. Using the -c <N> flag, stress creates <N> workers (forks) running sqrt() to generate some CPU load. Since this VM has two cores, I will spin two instance of stress with 2 workers each. By using the nice command, I’ll configure the niceness of the first instace to be 19, meaning that it will have a lower priority on the scheduling:
With getdelays we can check their CPU delays (output truncated):
$ ./getdelays -d -p 15722
PID 15722
CPU count real total virtual total delay total delay average
3386 130464000000 132726743949 4190941076 1.238ms
$ ./getdelays -d -p 15723
PID 15723
CPU count real total virtual total delay total delay average
3298 136240000000 138605044896 550886724 0.167ms
$ ./getdelays -d -p 15720
PID 15720
CPU count real total virtual total delay total delay average
533 2060000000 2084325118 142398167037 267.164ms
$ ./getdelays -d -p 15721
PID 15721
CPU count real total virtual total delay total delay average
564 2160000000 2178262982 148843119281 263.906ms
Clearly, the ones from with high niceness value are experience higher delays (the average delay is around 200x higher). If we ran both instances of stress with the same niceness, we will experience the same average delay accross then.
Block I/O delay
Let’s try to experience some I/O delays running a task. We can leverage docker to limit the I/O bps for
our process using the --driver-write-bps flag on docker run. First, let’s run dd without any limits:
The following screenshot shows the result obtained by running getdelays on the dd process:
root@ubuntu-xenial:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2904
print delayacct stats ON
PID 2904
CPU count real total virtual total delay total delay average
6255 1068000000 1879315354 22782428 0.004ms
IO count delay total delay average
5988 13072387639 2ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
We can see that we are getting an average of 2ms delays for I/O.
Now, let’s use --driver-write-bps to limit I/O to 1mbs:
The following screenshot shows the result of running getdelays on the process:
root@ubuntu-xenial:/home/ubuntu/github/linux/tools/accounting# ./getdelays -d -p 2705
print delayacct stats ON
listen forever
PID 2705
CPU count real total virtual total delay total delay average
71 28000000 32436630 600096 0.008ms
IO count delay total delay average
15 40163017300 2677ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
Since I/O is limited, dd takes much more time to write its output, we can see that our I/O delay average is 1000 times higher than before.
Side note: using --driver-write-<bps,iops> docker flags uses linux cgroups v1 and those are only able to limit the amount of I/O if we open the files with O_DIRECT, O_SYNC or O_DSYNC flags, but this deserver a blog post on its own.
Memory reclaim delay
In this example we can use, once more, the stress utility by using the --vm <N> flag to launch N workers running malloc/free to generate some memory allocation workload. Once again, this VM has 2 cores.
Using the default --vm-bytes, which is 256M, I was able to experience some delay on memory reclaim by running more than 2 workers. But the delay average was kept fairly small, below 1ms:
PID 15888
CPU count real total virtual total delay total delay average
2799 38948000000 39507647880 19772492888 7.064ms
RECLAIM count delay total delay average
11 278304 0ms
PID 15889
CPU count real total virtual total delay total delay average
3009 38412000000 38904584951 20402080112 6.780ms
RECLAIM count delay total delay average
22 16641801 0ms
PID 15890
CPU count real total virtual total delay total delay average
2954 39172000000 39772710066 19571509440 6.625ms
RECLAIM count delay total delay average
39 9505559 0ms
Since the 3 tasks are competing on a 2 core CPU, the CPU delays were much higher. Running with --vm-bytes with lower values produced even lower memory reclaim delays (in some cases, no delay is experienced).
Linux delays on higher level tools
Not many tools expose linux delays to the end user, but those are available on cpustat. I’m currently working on a PR to get them on htop.
One of the exercises of “The Linux Programming Interface” chapter 38 is about implementing program,
called douser, that should have the same functionality as the sudo progam. This means that if you run
$ douser whoami, this should asks for the root user password and, if the password matches, should run
the command whoami, which would print root. If the -u <user> flag is used, douser should ask for
the user password and execute whoami on behaf of that user, printing its name.
To be able to authenticate users on the system, our program must read the /etc/shadow file,
which is only readable by the root user. This means that it must run as root.
But it would be pretty bad if we needed to know the root password to run execute a command as an
unprivileged user, e.g, $ douser -u ubuntu ls. That is where the set-user-id permission bit comes to
our rescue.
Set-User-ID programs
A Set-User-ID program sets the process effective user ID to the same as the user ID that owns the executable
file. So it does not matter what user executes the program, the process will always run as the owner of
the executable.
If we inspect our sudo binary we can see that the set-user-ID permission bit is set on the file:
$ ls -l /usr/bin/sudo
-rwsr-xr-x 1 root root 155008 Oct 14 2016 /usr/bin/sudo
We can see that it shows an s instead of x on the execution permission bit.
The implementation
My full implementation is available on github.
In this section I will discuss the most relevant parts of it.
Authenticating
User authenticating is the most sensitive part of our program and is handled by the authenticate function.
This function uses getpwnam(username) to obtain a struct contained the fields available at the
/etc/passwd file on linux for the user with that particular username.
The user password is actually stored in a different file (/etc/shadow), read only by the root user.
We use the function getspnam(username) to obtain a struct representing the data on that file.
Then we use the function getpass to prompt the user for the password and the function crypt to
encrypt the password. If the input password and the real password matches, authenticate returns 0.
/*
authenticate prompts the user for the password
and validates it, returning 0 on success.
*/
int
authenticate(char *username)
{
struct spwd *spwd;
struct passwd *pwd;
char *encrypted;
Boolean authOk;
pwd = getpwnam(username);
if (pwd == NULL)
fatal("could not get password file");
spwd = getspnam(username);
if (spwd == NULL && errno == EACCES)
fatal("no permission to read shadow password file");
if (spwd != NULL)
pwd->pw_passwd = spwd->sp_pwdp;
encrypted = crypt(getpass("Password:"), pwd->pw_passwd);
return strcmp(encrypted, pwd->pw_passwd);
}
Executing the user provided command
After authenticating the user, we use the setuid syscall to set the effective user ID of the running
process to the ID of the authenticated user.
Following that, we use fork to create a child process that will have its text segment replaced by
using the execvpe library function. We replace the environment variables setting $PATH to known
safe locations. The parent process exits right after forking.
pwd = getpwnam(username);
if (pwd == NULL)
fatal("unable to retrieve user info");
if (setuid(pwd->pw_uid) != 0)
errExit("setuid");
pid = fork();
if (pid == -1) {
errExit("fork");
} else if (pid != 0) {
execvpe(argv[optind], &argv[optind], envp);
errExit("execvpe");
}
exit(EXIT_SUCCESS);
Usage
After compiling the code, one must login as root and properly set the douser binary using
chown root:root ./douser and chmod u+s ./douser. The latter turns on the set-user-ID permission bit
on the executable.
After the setup, run ./douser whoami and it should print root. Woot!
Disclaimer
The implementation is far from being a complete copy of sudo (that is probably obvious but wanted to
make it clear). The real sudo implementation can be read here.
I decided to share my implementation and some of my reasoning as a way to both share the knowlged and also
enforce it. Would love to get some feedback.
This past week a tsuru user was having a problem in one of his apps. Some units of his app would simply get stuck, timing out every request and apparently doing nothing. As a workaround he wanted to kill the problematic unit and force it to restart. On tsuru you are able to restart all units of an app but may not choose a particular unit to restart.
We then “sshed” into the problematic container, by using tsuru app shell -a <app-name>, and tried sending a SIGKILL to their application process (pid 1) and surprisingly it did not work.
$ sudo kill -9 1 # our first try, from inside the container
We tried SIGTERM and SIGQUIT and nothing happened. We then ssh’ed into the host, found out the pid (using docker top), issued the SIGKILL and boom the container restarted.
The only signals that can be sent to process ID 1, the init process, are those for which init has explicitly installed signal handlers. This is done to assure the system is not brought down accidentally.
So, to be able to kill the container from the inside, you need to register a handler for the particular signal. It turns out that you cannot register a handler for SIGKILL (you are also not able to ignore this signal). So, one must handle a different signal, e.g, SIGTERM, and use it to shutdown the application (by raising a SIGKILL or simply exiting).
The following code shows an example that might be used to check this behavior.
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
void handler(int sig)
{
exit(sig);
}
int main(int argc, char *argv[])
{
int duration;
if (argc > 1)
{
duration = atoi(argv[1]);
printf("Sleeping for %ds\n", duration);
sleep(duration);
exit(EXIT_SUCCESS);
}
if(signal(SIGQUIT, handler) == SIG_ERR)
exit(EXIT_FAILURE);
for (;;)
pause();
}
If the code is run as ./killable 30, the application will sleep for 30 seconds and then just exit. If that is the init process of a container, you won’t be able to send any signal to it as no handler was registered. If no argument is provided, a handler for the SIGQUIT signal is registered and we are able to send signals to it. In this latter case, we are able to kill the container successfully.
As it turns out, our advice to the user was to setup a signal handler for SIGTERM and to shutdown the application when receiving this signal.
Chapter 7 of “The Linux Programming Interface” is about memory allocation. One of the exercises, marked as advanced, asks the reader to implement malloc. I’ve decided to give it a shot. My full implementation is available on Github. I will try to break down some of my reasoning on the next sections and include some snippets from the code.
Memory layout of a Process
The memory allocated to each process is composed of multiple segments, as can be seen on the following image:
We are particularly interested on the heap (also known as the data segment), an area from which memory can be dynamically allocated at run time. The top end of the heap is called the program break.
Adjusting the program break
We can move the program break on our C program by using sbrk() and brk().
int brk(void *addr);
void *sbrk(intptr_t increment);
The first, moves the program break to the address pointed by addr, while the latter increments the program break by increment bytes. Their man pages can give more information about their implementation on Linux and other systems, but those are the basic building blocks that our malloc implementation will rely on. On Linux, sbrk relies on brk.
The implementation
The entire code for the implementation is available at github. Beware that this implementation is full of bugs (some are discussed below, some are obvious from reading the real malloc implementation). The following code is the malloc function implementation:
void *
_malloc(size_t size)
{
void *block_mem;
block_t *ptr, *newptr;
size_t alloc_size = size >= ALLOC_UNIT ? size + sizeof(block_t)
: ALLOC_UNIT;
ptr = head;
while (ptr) {
if (ptr->size >= size + sizeof(block_t)) {
block_mem = BLOCK_MEM(ptr);
fl_remove(ptr);
if (ptr->size == size) {
// we found a perfect sized block, return it
return block_mem;
}
// our block is bigger then requested, split it and add
// the spare to our free list
newptr = split(ptr, size);
fl_add(newptr);
return block_mem;
} else {
ptr = ptr->next;
}
}
/* We are unable to find a free block on our free list, so we
* should ask the OS for memory using sbrk. We will alloc
* more alloc_size bytes (probably way more than requested) and then
* split the newly allocated block to keep the spare space on our free
* list */
ptr = sbrk(alloc_size);
if (!ptr) {
printf("failed to alloc %ld\n", alloc_size);
return NULL;
}
ptr->next = NULL;
ptr->prev = NULL;
ptr->size = alloc_size - sizeof(block_t);
if (alloc_size > size + sizeof(block_t)) {
newptr = split(ptr, size);
fl_add(newptr);
}
return BLOCK_MEM(ptr);
}
Our implementation keeps a doubly linked listed of free memory blocks and every time _malloc gets called, we traverse the linked list looking for a block with at least the size requested by the user (lines 8–25). If a block with the exact requested size exists, we remove it from the list and return its address to the user (lines 11–16); if the block is larger, we split it into two blocks, return the one with the requested size to the user and adds the newly created block to the list (lines 19–21).
If we are unable to find a block on the list, we must “ask” the OS for more memory, by using the sbrk function (lines 31–35). To reduce the number of calls to sbrk, we alloc a fixed number of bytes that is a multiple of the memory page size, defined as:
#define ALLOC_UNIT 3 * sysconf(_SC_PAGESIZE)
After the call to sbrk (where our program break changes value) we create a new block with the allocated size. The metadata on this block contains the size, next and previous blocks and is allocated on the first 24 bytes of the block (this is our overhead) (lines 36–38). Since we may have allocated much more memory then the user requested, we split this new block and return the one with the exact same size as requested (lines 39–43). The BLOCK_MEM macro, defined as:
skips the metadata at given ptr and returns the address of the memory area that is available for the user.
The _free function is quite straightforward, given a pointer that was previously “malloced” to the user, we must find its metadata (by using the BLOCK_HEADER macro) and add it to our free linked list. After that, the function scan_merge() is called to do some cleaning:
/* scan_merge scans the free list in order to find
* continuous free blocks that can be merged and also
* checks if our last free block ends where the program
* break is. If it does, and the free block is larger then
* MIN_DEALLOC then the block is released to the OS, by
* calling brk to set the program break to the begin of
* the block */
void
scan_merge()
{
block_t *curr = head;
unsigned long header_curr, header_next;
unsigned long program_break = (unsigned long)sbrk(0);
if (program_break == 0) {
printf("failed to retrieve program break\n");
return;
}
while (curr->next) {
header_curr = (unsigned long)curr;
header_next = (unsigned long)curr->next;
if (header_curr + curr->size + sizeof(block_t) == header_next) {
/* found two continuous addressed blocks, merge them
* and create a new block with the sum of their sizes */
curr->size += curr->next->size + sizeof(block_t);
curr->next = curr->next->next;
if (curr->next) {
curr->next->prev = curr;
} else {
break;
}
}
curr = curr->next;
}
stats("after merge");
header_curr = (unsigned long)curr;
/* last check if our last free block ends on the program break and is
* big enough to be released to the OS (this check is to reduce the
* number of calls to sbrk/brk */
if (header_curr + curr->size + sizeof(block_t) == program_break
&& curr->size >= MIN_DEALLOC) {
fl_remove(curr);
if (brk(curr) != 0) {
printf("error freeing memory\n");
}
}
}
scan_merge() first traverses the linked list looking for continuous blocks (two different memory blocks that are free and correspond to continuous addresses). We keep the blocks sorted by address to make this step easier. For every two continuous blocks found, we merge both blocks to reduce our total overhead (less metadata to keep) (lines 18–33).
After finding the last block on our free list, we check if this blocks ends on the program break (line 39). If that is true, and the block is big enough (where “big” is defined as MIN_DEALLOC, also a multiple of the page size), we remove the block from our list and move the program break to the beginning of the block, by calling brk.
How is malloc actually implemented?
Before diving into the real malloc code, I decided to write a simple test program and trace it’s execution using strace. I used the following code:
The first thing I noticed between executions is that when asking for a large size, malloc actually used mmap instead of brk to allocate the memory. I wasn’t sure why, since I have yet to study mmap. But checking the code for glibc’s malloc I’ve found a pretty nice writeup made by the authors explaining their implementation. I highly recommend reading it. Regarding the use of mmap, from the source code comments:
This backup strategy generally applies only when systems have “holes” in address space, so sbrk cannot perform contiguous expansion, but there is still space available on system. On systems for which this is known to be useful (i.e. most linux kernels), this occurs only when programs allocate huge amounts of memory. Between this, and the fact that mmap regions tend to be limited, the size should be large, to avoid too many mmap calls and thus avoid running out of kernel resources.
The documentation also explains their design goals, motivations, how the blocks are organized (by using bins for the different sizes, instead of keeping a linked list sorted by address, as I did) and lots of other details. I learned a lot reading it.
Adding a call to free on my test program does not change the syscalls made by at, as the memory is not released to the OS (some people rely on this behavior “mallocing” a large chunk of memory and freeing it on the start of a program to reserve the memory). One can control this behavior by defining M_TRIM_THREASHOLD:
M_TRIM_THRESHOLD is the maximum amount of unused top-most memory to keep before releasing via malloc_trim in free(). Automatic trimming is mainly useful in long-lived programs. Because trimming via sbrk can be slow on some systems, and can sometimes be wasteful (in cases where programs immediately afterward allocate more large chunks) the value should be high enough so that your overall system performance would improve by releasing this much memory.
This blog post is part of a series of posts that I intent to write while reading “The Linux Programming Interface” to make sure I’m actually learning something, as a way to practice and to share knowledge.
Following my read of “The Linux Programming Interface”, I came across the concept of “file with holes” or “sparse files”. Those are files that try to use the file system more effectively by preventing it from using disk space when sections of the file are empty. The disk storage is only actually used when needed. Sparse files are really useful for backup utilities, disk images, database snapshots etc.
One way to create a sparse file is by using truncate:
$ truncate -s 10M testfile
We can check the apparent size of the file with the following command:
$ du --apparent-size testfile
10240 testfile
As stated on the man page for du, the apparent size is not the actual disk usage; it may be larger due to holes in sparse files, fragmentation, indirect blocks etc.
After introducing the topic, the “The Linux Programming Interface” book has an exercise on building a copy of the cp command that is able to create sparse files when the original file was sparse. The entire code for my implementation can be found here.
The real cp shell command has an heuristic to detect and keep the sparsity of the original file on to the target file (we can disable it with the flag sparse=never). Our cp copycat is going to be really simple: we will skip any ‘\0’ byte (that is how the holes are represented when reading file) by using lseek on the destination file.
Copying a file with holes
Surprisingly, while reading past the end of file results on and EOF error, writing past that does not result on an error. To create “holes” in a file, in c, all we need to do is use the lseek syscall to go beyond the end of the file.
First, let’s create a file with some content, followed by a hole and then some more content. This can be accomplished with the following c code:
This will create a file, “testfile”, write “begin” right at the start, skip 100000 bytes and write “ end” at that offset.
My implementation of cp can be viewed here, but the interesting part is the following code:
int i, holes = 0;
while ((numRead = read(inputFd, buf, BUF_SIZE)) > 0)
{
if (keepHoles == 1) {
for (i = 0; i < numRead; i++) {
if (buf[i] == '\0') {
holes++;
continue;
} else if (holes > 0) {
lseek(outputFd, holes, SEEK_CUR);
holes = 0;
}
if (write(outputFd, &buf[i], 1) != 1)
fatal("couldnt write char to file");
}
} else {
if (write(outputFd, buf, numRead) != numRead)
fatal("couldn't write whole buffer to file");
}
}
If we go for the naive approach (without the -k flag), just writing to the output file the exact bytes read from the original file, we will create a dense file. By passing the flag -k, we are able to keep the sparsity of the original file by using lseek(2) syscall to skip the bytes represented by \0.
Running our cp implementation on our test file results in the following results:
Great! By using lseek we successfully kept the same disk usage on our copy file!
This blog post is part of a series of posts that I intent to write while reading “The Linux Programming Interface” to make sure I’m actually learning something, as a way to practice and to share knowledge.
I’m currently reading “The Linux Programming Interface” and I intend to use this blog as a way to share what i’m learning (to make sure I’m actually learning).
In one of the first chapters there is an exercise about the reboot(2) syscall and some magic numbers that are used as parameters.
The reboot(2) syscall has the following signature:
int reboot(int magic, int magic2, int cmd, void *arg);
As can be read from the man page, the first parameter, magic, must be set to LINUX_REBOOT_MAGIC1, that is 0xfee1dead(nice one). But the mistery lies on the second parameter, magic2, which must be set to one of the following constants:
LINUX_REBOOT_MAGIC2 (that is, 672274793)
LINUX_REBOOT_MAGIC2A (that is, 85072278)
LINUX_REBOOT_MAGIC2B (that is, 369367448)
LINUX_REBOOT_MAGIC2C (that is, 537993216)
As the book and the man page states, converting these numbers to hexadecimal can give some clue on their meaning. We can use the following Go code(also available on go playground) to get their hexadecimal values.
package main
import (
"fmt"
"strconv"
)
func main() {
consts := []int64{672274793, 85072278, 369367448, 537993216}
for _, c := range consts {
fmt.Printf("%d: %s\n", c, strconv.FormatInt(c, 16))
}
}
At first glance those looked just like any random number for me, but looking a little closer I realized they are actually dates: 28/12/1969, 05/12/1996, 16/04/1998 and 20/11/2000. But what those dates mean?
The first date, 28/12/1969, was the day Linus Torvalds was born. The others all refer to his daughters birthdays. Cute.
What are those magic numbers for?
As stated on linux/reboot.c, they are merely used to make sure one does not reboot the machine by mistake:
* Reboot system call: for obvious reasons only root may call it,
* and even root needs to set up some magic numbers in the registers * so that some mistake won't make this reboot the whole machine.