gVisor

Someone pointed me at this press release from Google announcing a Docker / container sandbox for Linux.

I was intrigued enough to write a ‘quick look’ article on it here

What Does That Mean (tl;dr)?

It’s a way of achieving:

VM-like isolation while

using containers for app deployment and achieving

multi-tenancy, and

SELinux/Apparmor/Seccomp security control

What Does That Mean (Longer)?

There’s quite a few ways to limit the access a container has to the OS API. They’re listed and discussed on the gVisor Github page.

It explains what it does to the container:

gVisor intercepts all system calls made by the application, and does the necessary work to service them.

At first I thought gVisor was a ‘just’ syscall intermediary that could filter Linux API calls, but the sandboxing goes further than that:

Importantly, gVisor does not simply redirect application system calls through to the host kernel. Instead, gVisor implements most kernel primitives (signals, file systems, futexes, pipes, mm, etc.) and has complete system call handlers built on top of these primitives.

From a security perspective, this is a key quote as well:

Since gVisor is itself a user-space application, it will make some host system calls to support its operation, but much like a VMM, it will not allow the application to directly control the system calls it makes.

What really made my jaw drop was this:

[gVisor’s] Sentry [process] implements its own network stack (also written in Go) called netstack.

So it even goes to the length of not touching the host’s network stack. This reduces the attack surface of a malign container significantly.

From reading this it seems to implement most of an OS in userspace, only going to the host OS when necessary and allowed.

This is in contrast to tools like SELinux and AppArmor, which rely on host Kernel features and a bunch of root-defined constraining rules to ensure nothing bad happens.

SELinux is a fantastic technology that should be more used, but the reality is that it’s very hard for people to write and understand policies and feel comfortable with it since it’s so embedded in the kernel.

An alternative to achieve a similar thing might be to just use a VM:

But as the number of boxes above indicates, that’s a relatively heavy overhead to achieve isolation.

gVisor gives you the lightweight benefits of containers and the control of VMM and host-based kernel filters.

A Closer Look

I wrote a ShutIt script to create a re-usable Ubuntu VM in which I could build gVisor and run up sandboxed containers.

The build part of the script is here and the instructions to reproduce are here. If you get stuck contact me.

Here’s a video of the whole thing setting up in a VM using the above script:

Architecture

gVisor is a go binary that creates a runtime environment for the container instead of runc . It consists of two processes:

In order to provide defense-in-depth and limit the host system surface, the gVisor container runtime is normally split into two separate processes. First, the Sentry process includes the kernel and is responsible for executing user code and handling system calls. Second, file system operations that extend beyond the sandbox (not internal proc or tmp files, pipes, etc.) are sent to a proxy, called a Gofer, via a 9P connection.

I didn’t know what a 9P connection is. I assume it’s something to do with the Plan9 OS, but that’s just a guess.

You might also like these posts:

Docker Security Validation

A Field Guide to Docker Security Measures

SELinux Experimentation with Reduced Pain

Unprivileged Docker Builds – A Proof of Concept

If you set the Docker daemon up according to the docs, you get a set of debug files in /tmp/runsc :

-rw-r--r-- 1 root root 6435 May 5 07:53 runsc.log.20180505-075350.302600.create -rw-r--r-- 1 root root 1862 May 5 07:53 runsc.log.20180505-075350.337120.state -rw-r--r-- 1 root root 3180 May 5 07:53 runsc.log.20180505-075350.346384.start -rw-r--r-- 1 root root 1862 May 5 07:53 runsc.log.20180505-075350.529798.state -rw-r--r-- 1 root root 32705613 May 5 08:22 runsc.log.20180505-075350.312537.gofer -rw-r--r-- 1 root root 226843210 May 5 08:22 runsc.log.20180505-075350.319600.boot -rw-r--r-- 1 root root 1639 May 5 08:22 runsc.log.20180505-082250.158154.kill -rw-r--r-- 1 root root 1858 May 5 08:22 runsc.log.20180505-082250.210046.state -rw-r--r-- 1 root root 1639 May 5 08:22 runsc.log.20180505-082250.221802.kill -rw-r--r-- 1 root root 1600 May 5 08:22 runsc.log.20180505-082250.233557.delete

The interesting ones appear to be .gofer (which records calls made to the OS). When I noodled around, these mostly appeared to be requests to write to the docker filesystem on the host (which needs to happen when you write in the container):

D0505 07:53:50.516882 10831 x:0] Open reusing control file, mode: ReadOnly, "/var/lib/docker/overlay2/a8eadcb9a8427fa170e485f72d5aee6ee85a9c7b9176a6f01a6965f2bcd7e219/merged/bin/bash" D0505 07:53:50.516907 10831 x:0] send [FD 3] [Tag 000001] Rlopen{QID: QID{Type: 0, Version: 0, Path: 541783}, IoUnit: 0, File: &{{38}}}

or files

D0505 07:53:50.518729 10831 x:0] send [FD 3] [Tag 000001] Rreadlink{Target: /lib/x86_64-linux-gnu/ld-2.27.so} D0505 07:53:50.518869 10831 x:0] recv [FD 3] [Tag 000001] Twalkgetattr{FID: 1, NewFID: 15, Names: [lib]} D0505 07:53:50.518927 10831 x:0] send [FD 3] [Tag 000001] Rwalkgetattr{Valid: AttrMask{with: Mode NLink UID GID RDev ATime MTime CTime Size Blocks}, Attr: Attr{Mode: 0o40755, UID: 0, GID: 0, NLink: 8, RDev: 0, Size: 4096, BlockSize: 4096, Blocks: 8, ATime: {Sec: 1525506464, NanoSec: 357515221}, MTime: {Sec: 1524777373, NanoSec: 0}, CTime: {Sec: 1525506464, NanoSec: 345515221}, BTime: {Sec: 0, NanoSec: 0}, Gen: 0, DataVersion: 0}, QIDs: [QID{Type: 128, Version: 0, Path: 542038}]}

The .boot file is the strace log from the container, which combined with the .gofer log can tell you what’s going on in and out of the container’s userspace.

Matching the above time of the opening of the file up I see this in the .boot log:

D0505 07:53:50.518797 10835 x:0] recv [FD 4] [Tag 000001] Rreadlink{Target: /lib/x86_64-linux-gnu/ld-2.27.so} D0505 07:53:50.518824 10835 x:0] send [FD 4] [Tag 000001] Twalkgetattr{FID: 1, NewFID: 15, Names: [lib]} D0505 07:53:50.519041 10835 x:0] recv [FD 4] [Tag 000001] Rwalkgetattr{Valid: AttrMask{with: Mode NLink UID GID RDev ATime MTime CTime Size Blocks}, Attr: Attr{Mode: 0o40755, UID: 0, GID: 0, NLink: 8, RDev: 0, Size: 4096, BlockSize: 4096, Blocks: 8, ATime: {Sec: 1525506464, NanoSec: 357515221}, MTime: {Sec: 1524777373, NanoSec: 0}, CTime: {Sec: 1525506464, NanoSec: 345515221}, BTime: {Sec: 0, NanoSec: 0}, Gen: 0, DataVersion: 0}, QIDs: [QID{Type: 128, Version: 0, Path: 542038}]}

Boot also has intriguing stuff like this in it:

D0505 07:53:50.514800 10835 x:0] urpc: unmarshal success. W0505 07:53:50.514848 10835 x:0] *** SECCOMP WARNING: console is enabled: syscall filters less restrictive! I0505 07:53:50.514881 10835 x:0] Installing seccomp filters for 63 syscalls (kill=false) I0505 07:53:50.514890 10835 x:0] syscall filter: 0 I0505 07:53:50.514901 10835 x:0] syscall filter: 1 I0505 07:53:50.514916 10835 x:0] syscall filter: 3 I0505 07:53:50.514928 10835 x:0] syscall filter: 5 I0505 07:53:50.514952 10835 x:0] syscall filter: 7

Still Very Beta

I had lots of problems doing basic things in this sandbox, so believe them when they say this is a work in progress.

For example, I ran an apt install and got this error:

E: Can not write log (Is /dev/pts mounted?) - posix_openpt (2: No such file or directory)

which I’d never seen before.

Also, when I pinged:

root@a2e899f2e8af:/# ping google.com root@a2e899f2e8af:/# ping bbc.co.uk root@a2e899f2e8af:/#

It returned immediately but I got no output at all.

I also saw errors with simple commands when apt install ing. Running these commands by hand, I got some kind of race condition that couldn’t be escaped: