Listing 29: <<capabilities>> += int capabilities() { fprintf(stderr, "=> dropping capabilities...");

CAP_AUDIT_CONTROL , _READ , and _WRITE allow access to the audit system of the kernel (i.e. functions like audit_set_enabled , usually used with auditctl ). The kernel prevents messages that normally require CAP_AUDIT_CONTROL outside of the first pid namespace, but it does allow messages that would require CAP_AUDIT_READ and CAP_AUDIT_WRITE from any namespace. So let's drop them all. We especially want to drop CAP_AUDIT_READ , since it isn't namespaced and may contain important information, but CAP_AUDIT_WRITE may also allow the contained process to falsify logs or DOS the audit system.

Listing 32: <<capabilities>> += int drop_caps[] = { CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_AUDIT_WRITE,

CAP_BLOCK_SUSPEND lets programs prevent the system from suspending, either with EPOLLWAKEUP or /proc/sys/wake_lock. Supend isn't namespaced, so we'd like to prevent this.

Listing 34: <<capabilities>> += CAP_BLOCK_SUSPEND,

CAP_DAC_READ_SEARCH lets programs call open_by_handle_at with an arbitrary struct file_handle * . struct file_handle is in theory an opaque type, but in practice it corresponds to inode numbers. So it's easy to brute-force them, and read arbitrary files. This was used by Sebastian Krahmer to write a program to read arbitrary system files from within Docker in 2014.

Listing 36: <<capabilities>> += CAP_DAC_READ_SEARCH,

CAP_FSETID , without user namespacing, allows the process to modify a setuid executable without removing the setuid bit. This is pretty dangerous! It means that if we include a setuid binary in a container, it's easy for us to accidentally leave a dangerous setuid root binary on our disk, which any user can use to escalate privileges.

Listing 40: <<capabilities>> += CAP_FSETID,

CAP_IPC_LOCK can be used to lock more of a process' own memory than would normally be allowed, which could be a way to deny service.

Listing 43: <<capabilities>> += CAP_IPC_LOCK,

CAP_MAC_ADMIN and CAP_MAC_OVERRIDE are used by the mandatory acess control systems Apparmor, SELinux, and SMACK to restrict access to their settings. These aren't namespaced, so they could be used by the contained programs to circumvent system-wide access control.

Listing 44: <<capabilities>> += CAP_MAC_ADMIN, CAP_MAC_OVERRIDE,

CAP_MKNOD , without user namespacing, allows programs to create device files corresponding to real-world devices. This includes creating new device files for existing hardware. If this capability were not dropped, a contained process could re-create the hard disk device, remount it, and read or write to it.

Listing 47: <<capabilities>> += CAP_MKNOD,

I was worried that CAP_SETFCAP could be used to add a capability to an executable and execve it, but it's not actually possible for a process to set capabilities it doesn't have. But! An executable altered this way could be executed by any unsandboxed user, so I think it unacceptably undermines the security of the system.

Listing 51: <<capabilities>> += CAP_SETFCAP,

CAP_SYSLOG lets users perform destructive actions against the syslog. Importantly, it doesn't prevent contained processes from reading the syslog, which could be risky. It also exposes kernel addresses, which could be used to circumvent kernel address layout randomization.

Listing 54: <<capabilities>> += CAP_SYSLOG,

CAP_SYS_ADMIN allows many behaviors! We don't want most of them ( mount , vm86 , etc). Some would be nice to have ( sethostname , mount for bind mounts…) but the extra complexity doesn't seem worth it.

Listing 55: <<capabilities>> += CAP_SYS_ADMIN,

CAP_SYS_BOOT allows programs to restart the system (the reboot syscall) and load new kernels (the kexec_load and kexec_file syscalls). We absolutely don't want this. reboot is user-namespaced, and the kexec* functions only work in the root user namespace, but neither of those help us.

Listing 59: <<capabilities>> += CAP_SYS_BOOT,

CAP_SYS_MODULE is used by the syscalls delete_module , init_module , finit_module , by the code for kmod , and by the code for loading device modules with ioctl.

Listing 66: <<capabilities>> += CAP_SYS_MODULE,

CAP_SYS_NICE allows processes to set higher priority on given pids than the default. The default kernel scheduler doesn't know anything about pid namespaces, so it's possible for a contained process to deny service to the rest of the system.

Listing 71: <<capabilities>> += CAP_SYS_NICE,

CAP_SYS_RAWIO allows full access to the host systems memory with /proc/kcore , /dev/mem , and /dev/kmem , but a contained process would need mknod to access these within the namespace.. But it also allows things like iopl and ioperm , which give raw access to the IO ports.

Listing 76: <<capabilities>> += CAP_SYS_RAWIO,

CAP_SYS_RESOURCE specifically allows circumventing kernel-wide limits, so we probably should drop it. But I don't think this can do more than DOS the kernel, in general.

Listing 78: <<capabilities>> += CAP_SYS_RESOURCE,

CAP_SYS_TIME : setting the time isn't namespaced, so we should prevent contained processes from altering the system-wide time.

Listing 79: <<capabilities>> += CAP_SYS_TIME,

CAP_WAKE_ALARM , like CAP_BLOCK_SUSPEND , lets the contained process interfere with suspend, and we'd like to prevent that.

Listing 81: <<capabilities>> += CAP_WAKE_ALARM };