Hey,

Something that is very common to get wrong when starting with Linux containers is to think that free and other tools like top should report the memory limits.

Here you’ll not only go through why that happens and how to get it right, but also take a look at where is the Kernel looking for information when you ask it for memory statistics.

Also, if you’re curious about how the code for keeping track of per-cgroup page counter looks, stick to the end!

This is the third article in a series of 30 articles around procfs: A Month of /proc. If you’d like to keep up to date with it, make sure you join the mailing list!

Running top within a container

To get a testbed for the rest of the article, consider the case of running a single container with a memory limit of 10MB in a system that has 2GB of RAM available:

# Check the amount of memory available # outside the container (i.e., in the host) free -h total used free available Mem: 1.9G 312M 385M 1.5G # Define the total number of bytes that # will dictate the memory limit of the # container. MEM_MAX = " $(( 1024 * 1024 * 10 )) " # Run a container using the ubuntu image # as its base image, with the memory limit # set to 10MB, and a tty as well as interactive # support. docker run \ --interactive \ --tty \ --memory $MEM_MAX \ ubuntu

With the container running, we can now check what are the results from executing top over there:

top -bn1 Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie %Cpu ( s ) : 0.2 us, 0.1 sy, 0.0 ni, 99.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st .----------------. | | KiB Mem : | 2040940 total, | 117612 free, 651204 used, 1272124 buff/cache KiB Swap: | 0 total, | 0 free, 0 used. 1196972 avail Mem *--+-------------* PID USER | PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root | 20 0 18508 3432 3016 S 0.0 0.2 0:00.02 bash 12 root | 20 0 36484 3104 2748 R 0.0 0.2 0:00.00 top | *---> Not really what we expect, that is 2GB!!

As we outlined before, not what one would typically expect (it reports the total available memory as seen in the host - not showing the 10MB limit at all).

What about free ? Same thing:

free -h total used free available Mem: 1.9G 612M 131M 1.2G Swap: 0B 0B 0B

If we go inspect what are the syscalls being used by both top and free , we can see that they’re making use of plain open(2) and read(2) calls:

# Check what are the syscalls being # used by `free` strace -f free ... openat ( AT_FDCWD, "/proc/meminfo" , O_RDONLY ) = 3 .-------. | v read ( 3, "MemTotal: | 2040940 kB

MemF" ..., 8191 ) = 1307 ... | That is 2GB! # Check what are the syscalls being used # by `top` strace -f top -p 19282 -bn1 ... openat ( AT_FDCWD, "/proc/meminfo" , O_RDONLY ) = 5 lseek ( 5, 0, SEEK_SET ) = 0 read ( 5, "MemTotal: 2040940 kB

MemF" ..., 8191 ) = 1307 ... ^ | 2GB again --------*

Looking at those return values (what it’s read), we can spot that the “problem” is coming from /proc/meminfo , which free and top are just blindly trusting.

Before we go check what the Kernel is doing when reporting those values, let’s quickly remember how a container gets memory limits set.

Setting container limits

The way that Docker (ok, runc ) ends up setting the container limits is via the use of cgroups .

As very well documented in the man page (see man 7 cgroups :

Control cgroups, usually referred to as cgroups, are a Linux kernel feature which allows processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored.

To see that in action, consider the following program that allocates memory in chunks of 1MB:

#include <stdio.h> #include <stdlib.h> #include <string.h> #define MEGABYTE (1 << 20) #define ALLOCATIONS 20 /** * alloc - a "leaky" program that just allocated * a predefined amount of memory and then * exits. */ int main ( int argc , char ** argv ) { printf ( "allocating: %dMB

" , ALLOCATIONS ); void * p ; int i = ALLOCATIONS ; while ( i -- > 0 ) { // Allocate 1MB (not initializing it // though). p = malloc ( MEGABYTE ); if ( p == NULL ) { perror ( "malloc" ); return 1 ; } // Explicitly initialize the area that // has been allocated. memset ( p , 65 , MEGABYTE ); printf ( "remaining \t %d

" , i ); } }

We can see that without any limits, we can keep allocating past 20MB without problems.

# Keep allocating memory until the 20MB # mark gets reached. ./alloc.out allocating: 20MB remaining 19 remaining 18 ... remaining 1 remaining 0

That changes after we put our process under a cgroup with memory limits set:

# Create our custom cgroup mkdir /sys/fs/cgroup/memory/custom-group # Configure the maximum amount of memory # that all of the processes in such cgroup # will be able to allocate echo " $(( 1024 * 1024 * 10 )) " > \ /sys/fs/cgroup/memory/custom-group/memory.limit_in_bytes # Put the current process tree under such # cgroup echo $$ > \ /sys/fs/cgroup/memory/custom-group/tasks # Try to allocate the 20MB ./alloc.out allocating: 20MB remaining 19 remaining 18 remaining 17 remaining 16 remaining 15 remaining 14 remaining 13 remaining 12 Killed

Looking at the results from dmesg , we can see what happened:

our thing getting killed! .------------. [ 181346.109904 ] alloc.out invoked | oom-killer: | *------------* [ 181346.109906 ] alloc.out cpuset = / mems_allowed = 0 [ 181346.109911 ] CPU: 0 PID: 22074 Comm: alloc.out Not tainted 4.15.0-36-generic #39-Ubuntu [ 181346.109911 ] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 181346.109912 ] Call Trace: [ 181346.109918 ] dump_stack+0x63/0x8b [ 181346.109920 ] dump_header+0x71/0x285 [ 181346.109923 ] oom_kill_process+0x220/0x440 [ 181346.109924 ] out_of_memory+0x2d1/0x4f0 [ 181346.109926 ] mem_cgroup_out_of_memory+0x4b/0x80 [ 181346.109928 ] mem_cgroup_oom_synchronize+0x2e8/0x320 [ 181346.109930 ] ? mem_cgroup_css_online+0x40/0x40 [ 181346.109931 ] pagefault_out_of_memory+0x36/0x7b [ 181346.109934 ] mm_fault_error+0x90/0x180 [ 181346.109935 ] __do_page_fault+0x4a5/0x4d0 [ 181346.109937 ] do_page_fault+0x2e/0xe0 [ 181346.109940 ] ? page_fault+0x2f/0x50 [ 181346.109941 ] page_fault+0x45/0x50 Killed! ... ____________________________ / \ [ 181346.109950 ] Task in /custom-group killed as a result of limit of /custom-group [ 181346.109954 ] memory: usage 10240kB, limit 10240kB, failcnt 56 [ 181346.109954 ] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 181346.109955 ] kmem: usage 940kB, limit 9007199254740988kB, failcnt 0 [ 181346.109955 ] Memory cgroup stats for /custom-group: cache:0KB rss:9300KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:9248KB inactive_file:0KB active_file:0KB unevictable:0KB [ 181346.109965 ] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 181346.110005 ] [ 21530 ] 0 21530 5837 1381 90112 0 0 bash [ 181346.110011 ] [ 22074 ] 0 22074 3440 2594 69632 0 0 alloc.out [ 181346.110012 ] Memory cgroup out of memory: Kill process 22074 ( alloc.out ) score 989 or sacrifice child [ 181346.318942 ] Killed process 22074 ( alloc.out ) total-vm:13760kB, anon-rss:8988kB, file-rss:1388kB, shmem-rss:0kB [ 181346.322003 ] oom_reaper: reaped process 22074 ( alloc.out ) , now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So we can see pretty well that limits are being enforced.

Again, why is /proc telling us that we have 2GB of memory?

Memory limits set by cgroups are not namespaced

The reason why is that the memory retrieved by /proc/meminfo is not namespaced.

Differently from other things like listing pids from /proc , when the file_operations that procfs implements reach the point of gathering memory information, it doesn’t acquire a namespaced view of it.

For instance, let’s compare the way that listing the differences in showing contents under /proc/ (listing the directory entries) and /proc/meminfo .

In the case of listing /proc (see How is /proc able to list process IDs?), we can see procfs taking the namespace reference and using it:

int proc_pid_readdir ( struct file * file , struct dir_context * ctx ) { // Takes the namespace as seen by the file // provided. struct pid_namespace * ns = file_inode ( file ) -> i_sb -> s_fs_info ; // ... // Iterates through the next available tasks // (processes) as seen by the namespace that // we are within. for ( iter = next_tgid ( ns , iter ); iter . task ; iter . tgid += 1 , iter = next_tgid ( ns , iter )) { // ... } // ... }

Meanwhile, in the case of reading /proc/meminfo , that doesn’t happen at all (well, as expected, it’s not about namespaces! It’s about cgroups):

static int meminfo_proc_show ( struct seq_file * m , void * v ) { struct sysinfo i ; // ... // Populate the sysinfo struct with memory-related // stuff si_meminfo ( & i ); // Add swap information si_swapinfo ( & i ); // ... start displaying show_val_kb ( m , "MemTotal: " , i . totalram ); show_val_kb ( m , "MemFree: " , i . freeram ); // ... }

As expected, no single reference to namespaces (or cgroups).

Also, si_meminfo , the method that fills the sysinfo interface takes some global values and bring it to /proc/meminfo , has no idea about cgroups either:

/** * The struct that holds part of the memory information * that ends up being displayed in the end. */ struct sysinfo { __kernel_long_t uptime ; /* Seconds since boot */ __kernel_ulong_t loads [ 3 ]; /* 1, 5, and 15 minute load averages */ __kernel_ulong_t totalram ; /* Total usable main memory size */ __kernel_ulong_t freeram ; /* Available memory size */ __kernel_ulong_t sharedram ; /* Amount of shared memory */ __kernel_ulong_t bufferram ; /* Memory used by buffers */ __kernel_ulong_t totalswap ; /* Total swap space size */ __kernel_ulong_t freeswap ; /* swap space still available */ __u16 procs ; /* Number of current processes */ __u16 pad ; /* Explicit padding for m68k */ __kernel_ulong_t totalhigh ; /* Total high memory size */ __kernel_ulong_t freehigh ; /* Available high memory size */ __u32 mem_unit ; /* Memory unit size in bytes */ char _f [ 20 - 2 * sizeof ( __kernel_ulong_t ) - sizeof ( __u32 )]; /* Padding: libc5 uses this.. */ }; /** * Fills the `sysinfo` struct passed as a pointer * with values collected from the system (globally * set). */ void si_meminfo ( struct sysinfo * val ) { val -> totalram = totalram_pages ; val -> sharedram = global_node_page_state ( NR_SHMEM ); val -> freeram = global_zone_page_state ( NR_FREE_PAGES ); val -> bufferram = nr_blockdev_pages (); val -> totalhigh = totalhigh_pages ; val -> freehigh = nr_free_highpages (); val -> mem_unit = PAGE_SIZE ; }

Interesting fact: totalram_pages (reported from MemTotal ) can change - see this StackOverflow question: Why does MemTotal in /proc/meminfo change?.

Who’s controlling the allocation of memory?

If you’re now wondering where we end up reaching that limit that we set in the cgroup, we need to look at the path that a memory allocation takes.

alloc.out (our process) | | *--> task_struct (process descriptor) | | *--> mm_struct (memory descriptor) | | m_cgroup <------* | +------> page_counter memory | | | *--> { atomic_long_t count, unsigned long limit } | | *------> page_counter swap

Within the Kernel, each process created (in our case, alloc.out ) is referenced internally via a process descriptor task_struct :

struct task_struct { struct thread_info thread_info ; // ... unsigned int cpu ; struct mm_struct * mm ;

Such process descriptor references a memory descriptor mm defined as mm_struct :

struct mm_struct { struct vm_area_struct * mmap ; /* list of VMAs */ unsigned long mmap_base ; /* base of mmap area */ unsigned long task_size ; /* size of task vm space */ // ... #ifdef CONFIG_MEMCG struct mem_cgroup * mem_cgroup ; #endif }

Such memory descriptor references a mem_cgroup , a data structure that keeps track of the cgroup semantics for memory limiting and accounting:

struct mem_cgroup { struct cgroup_subsys_state css ; /* Private memcg ID. Used to ID objects that outlive the cgroup */ struct mem_cgroup_id id ; /* Accounted resources */ struct page_counter memory ; struct page_counter swap ; // ... }

Such cgroup data structure then references some page counters ( memory and swap , for instance) defined via the page_counter struct, which are responsible for keeping track of usage and providing the limiting functionality when someone tries to acquire a page:

struct page_counter { atomic_long_t count ; unsigned long limit ; // The parent CGROUP (remember, cgroups are // hierarchical!) struct page_counter * parent ; // ... };

Whenever a process needs some pages assigned to it, page_counter_try_charge goes through the cgroup memory hierarchy, trying to charge a given number of pages, which in case of success (new value would be smaller than the limit), it updates the counts, otherwise, it triggers OOM behavior.

Using bcc to trace page_counter_try_charge , we can see how the act of page_fault ing leads to mem_cgroup_try_charge calling page_counter_try_charge :

25641 25641 alloc.out page_counter_try_charge page_counter_try_charge+0x1 [kernel] mem_cgroup_try_charge+0x93 [kernel] handle_pte_fault+0x3e3 [kernel] __handle_mm_fault+0x478 [kernel] handle_mm_fault+0xb1 [kernel] __do_page_fault+0x250 [kernel] do_page_fault+0x2e [kernel] page_fault+0x45 [kernel]

Tracing a cgroup running out of memory

If we’re even more curious and decide to trace the page_counter_try_charge arguments, we can see the tries failing in the case when we’re within a container and try to grab more memory than we’re allowed to.

Using bpftrace , we’re able to tailor a small program that inspects the page_counter used in page_counter_try_charge and see how the limit changes over time (until the point that we reach the exhaustion - receiving an OOM then).

#include <linux/page_counter.h> BEGIN { printf ( "Tracing page_counter_try_charge... Hit Ctrl-C to end.

" ); printf ( "%-8s %-6s %-16s %-10s %-10s %-10s

" , "TIME" , "PID" , "COMM" , "REQUESTED" , "CURRENT" , "LIMIT" ); @epoch = nsecs ; } kprobe:page_counter_try_charge { $pcounter = ( page_counter * ) arg0 ; $limit = $pcounter -> limit ; $current = $pcounter -> count . counter ; $requested = arg1 ; printf ( "%-8d %-6d %-16s %-10ld %-10ld %-10ld

" , ( nsecs - @epoch ) / 1000000 , pid , comm , $requested , $current , $limit ); }

Running the tracer with a shell session put into the cgroup that limits our memory, we can see it running out of pages:

sudo bpftrace ./try-charge-counter.d Attaching 2 probes... Tracing page_counter_try_charge... Hit Ctrl-C to end. TIME PID REQUESTED CURRENT LIMIT ... 3301 25980 32 1288 2560 3302 25980 32 1320 2560 ... 3307 25980 1 2553 2560 3307 25980 32 2554 2560 .--------------------. 3307 25980 1 | 2554 2560 | 3308 25980 32 | 2555 2560 | 3308 25980 1 | 2555 2560 | 3308 25980 32 | 2556 2560 | 3308 25980 1 | 2556 2560 | 3308 25980 32 | 2557 2560 | 3308 25980 1 | 2557 2560 | 3308 25980 32 | 2558 2560 | *----------.---------* | still possible to increase the number of pages ... 3308 25980 1 2558 2560 3308 25980 32 2559 2560 3308 25980 1 2559 2560 3308 25980 32 2560 2560 * LIMIT REACHED 3308 25980 1 2560 2560 * 3308 25980 1 2560 2560 * | | *-----.----* | Whoopsy, can ' t allocate <------* anymore!

Closing thoughts

Although I’ve understood that meminfo wasn’t namespaced, it wasn’t clear for my why.

Going through the exercise of tailoring a quick program to inspect the arguments passed to page_counter_try_charge was very interesting (and easier than I thought!).

Shout out to bpftrace once again for allowing us to go deep into the Kernel with ease!

If you have any further questions, or just want to connect, let me know! I’m cirowrc on Twitter.

Have a good one!

Resources