Seccomp in Kubernetes — Part I: 7 things you should know before you even start!

The first of a series on how to land great seccomp profiles in a SecDevOpsy way without resorting to magic or sorcery. In the first part I will cover the basics and the internals of the Kubernetes seccomp implementation.

The Kubernetes ecosystem have quite a few security features to keep your containers safe and isolated. Here I will be talking about the Secure Compute Mode (a.k.a. seccomp) feature, which focuses on limiting what system calls your containers will be able to execute.

Why is this important? Well, containers are actually just a process running inside a given machine. It shares the kernel with all the other applications. If all containers had the ability to make any system calls, it would not take long for malicious programs to bypass the container isolation and impact other applications — eavesdrop information, change system level settings, etc.

Your seccomp profiles defines what system calls should be allowed or blocked, and the container runtime will apply then at container start time so the kernel can enforce it. Once applied, you are effectively decreasing your attack surface and limiting the damage in case anything inside your container (i.e. your dependencies, or their dependencies…) start doing something they should not be allowed to.

Getting the basics out of the way

A basic seccomp has three key elements: the defaultAction , the architectures (or archMap ) and the syscalls :

The defaultAction defines what will happen by default to any system call that is not listed in the syscalls section. To make it simple, let's focus on the two main values you will use: SCMP_ACT_ERRNO which will block the execution of the system call, and SCMP_ACT_ALLOW which does what it says in the tin.

The element architectures defines what architectures you are targetting. This is important because the actual filter that is applied at kernel level is based on the system call IDs and not the names you define in your profile. The container runtime will translate it into IDs just before applying it. The importance of this is that system calls may have different IDs depending in the architectures they are running on. For example, the system call recvfrom which is used to receive information from a socket, is id 45 for x64 systems while it is id 517 in x86. Here's a list of all system calls for x86-x64.

syscalls is where you list all the system calls and the action associated with them. For example, you can create a whitelist by setting the defaultAction to SCMP_ACT_ERRNO and the action within your syscalls section to SCMP_ACT_ALLOW . This way your are whitelisting all the calls you enumerated and blocking everything else. For a blacklist approach, revert the values of defaultAction and action .

Now let's switch gears and get through the bits that may not be that obvious. But before, notice that the recommendations below assumes that you are deploying line of business applications into Kubernetes, and that running with least privileges is somehow important to you.

1. AllowPrivilegeEscalation=false

In the container's SecurityContext there is a setting called AllowPrivilegeEscalation. When this is set to false your containers will be running with the no_new_priv bit on . This effectively does what it says in the tin, it blocks the container from spanning new processes with more privileges than itself.

Another side effect when this setting is true (which is the default) is that the container runtime will apply your seccomp profile very early in the container starting process. So all the syscalls required for the runtime internal processes to run, such as setting the container user/group ids and dropping capabilities, will have to be whitelisted in your profile.

So for a container that simply executes echo hi you would need this:

instead of this:

But then again, why is this a problem? Well, I personally would avoid whitelisting this syscalls if I am not using them: capset , set_tid_address , setgid , setgroups and setuid . However, the real problem is that by needing to whitelist processes that you have absolutely no control of, tie your profiles to the container runtime implementation. Meaning, you (or most probably your cloud provider) update your container runtime and all of a sudden your containers can no longer start.

Pro tip #1: Run your containers with AllowPrivilegeEscaltion=false . It will make your seccomp profiles smaller and less likely to be impacted by container runtime changes.

2. Setting seccomp profiles at Container Level

When setting a seccomp profile you have the option to set it at pod level:

annotations:

seccomp.security.alpha.kubernetes.io/pod: "localhost/profile.json"

or at container level:

annotations:

container.security.alpha.kubernetes.io/<container-name>: "localhost/profile.json"

Please note that the syntax above will change when Kubernetes seccomp becomes GA.

One thing that is not well known is that historically Kubernetes always had a bug which forces the seccomp profiles to be applied against the pause container. Although that is abstracted away by the runtime, your pods do have that container, as that's what is used to setup the pod infrastructure.

The problem is, that container is always executed with AllowPrivilegeEscalation=true leading to the same issue as we talked on point 1, and there is no way for you to change that.

By setting your seccomp at container level you avoid that trap and will be able to create a profile that focuses mostly on your container. That will be the case until the bug is fixed and a new version (maybe 1.18?) is widely available to users.

Pro tip #2: Set your seccomp profiles at container level.

As a rule of thumb, this point is generally a very good answer to the question: “Why does my seccomp profile work with docker run but does not work when deployed into a Kubernetes cluster?"

3. Use runtime/default as a last resort

Kubernetes currently has two options for built-in profiles: runtime/default and docker/default . Both are implemented by the container runtime and not by Kubernetes. Therefore, they may vary depending on which runtime/version you are using.

Therefore, by simply changing the runtime your container may have a different set of system calls it may or may not use. The docker implementation is what most runtimes use, if you want to use this profile, make sure you are happy with what it entails.

The profile docker/default is deprecated since Kubernetes 1.11, so avoid using it.

In my opinion, the runtime/default profile is great for the purpose it was created: protect users from running a docker run command in their own machines and potentially have their machines being compromised. However, when it comes down to line of business applications running in Kubernetes clusters, I would argue that such profile is way too open and developers should focus on creating profiles that are application (or application type) specific.

Pro tip #3: Create application-specific seccomp profiles. If you can't do that, go for application type seccomp profiles, for example create a superset profile that encompass all your golang web api applications. As a last resort use runtime/default .

In future posts I will cover how to approach the creation of seccomp profiles in a SecDevOpsy way, automating and testing them through your pipelines. That way you won't have a excuse not to go for app-specific profiles. ;)

4. Unconfined should NOT be an option

One of the things picked up by the Kubernetes' first security audit was that seccomp comes disabled by default. Which means, unless you create a PodSecurityPolicy that enables it in your cluster, all pods that do not specify a seccomp profile will automatically run with seccomp=unconfined .

Running in this mode means one less isolation layer to protect your cluster and is advised against by the security community.

Pro tip #4: No container in your cluster should run as seccomp=unconfined , specially in production environments.

5. "Audit mode"

This point is not unique to Kubernetes, but it falls under "what you should know before you even start". :)

Historically, creating seccomp profiles was a pain and largely based on trial and error. Not that it has changed much, but you simply would not have a way to test it in production environments without risking breaking your application.

Since linux kernel 4.14 it is now possible to define parts of your profile to run in audit mode, logging into syslog all the system calls you want without blocking them. To do that you can use the action SCMT_ACT_LOG :

SCMP_ACT_LOG: The seccomp filter will have no effect on the thread calling

the syscall if it does not match any of the configured seccomp

filter rules but the syscall will be logged.

A good strategy for using this would be:

Allow the system calls you know you need. Block the system calls you know you don't need. Log everything else.

A simplistic example would look like this:

But remember, you must block all calls that you know for a fact you won't be using and may potentially harm your cluster. A good source to get a list going is the official docker documentation in which they explain what system calls they blocked in their default profile and why.

But there is a catch! Although SCMT_ACT_LOG is supported by kernel since ebd of 2017, it only recently made it to the Kubernetes ecosystem. So to use this, you will need to be running on a at least Linux kernel 4.14 and runC version v1.0.0-rc9.

Pro tip #5: Create audit mode profiles to test in production by mixing a blacklist with a whitelist and logging all exceptions.

6. Prefer whitelists

Whitelists add an extra effort on this process, as you need to identify every single system call your application may do in order to land your profile, but it does add that extra safety layer:

It is strongly recommended to use a whitelisting approach whenever possible because such an approach is more robust and simple. A blacklist will have to be updated whenever a potentially dangerous system call is added (or a dangerous flag or option if those are blacklisted), and it is often possible to alter the representation of a value without altering its meaning, leading to a blacklist bypass.

For go applications I have developed a tool goes through the execution path and pulls all the system calls made. For the application below:

By running gosystract against it:

Here’s what you get:

"sched_yield",

"futex",

"write",

"mmap",

"exit_group",

"madvise",

"rt_sigprocmask",

"getpid",

"gettid",

"tgkill",

"rt_sigaction",

"read",

"getpgrp",

"arch_prctl",

I will cover more on tooling later on, this is just a start. :)

Pro tip #6: Allow the system calls you know you need, block everything else.

7. Get the basics right or face unexpected behaviours

Whatever you define in your seccomp profile, the kernel will enforce it. Even if that is not what you want. For example, if you block access to calls such as exit or exit_group your container may not be able to exit and something as simple as a "echo hi" could trap the container in an exit loop indefinitely. Leading to high CPU usage of your cluster:

In this cases, strace can come handy and will show you what the problem may be:

sudo strace -c -p 9331

Make sure your profiles are a good reflection of all system calls that may be in your application execution path.

Pro tip #7: Be comprehensive and make sure all the basic system calls have been whitelisted.