Key Takeaways Near the close of 2018, Amazon amped up an already-increasing interest level in the marrying of container and hypervisor technology by announcing Firecracker, a Rust-based Virtual Machine Monitor

Around the midpoint of this year, Weaveworks introduced a new project, Ignite, which wraps Amazon’s Firecracker project with a container-lifecycle UI experience.

One clear use case for this new marriage of hypervisor technology and containers is to add a layer of protection and isolation to potentially sensitive (or untrusted) workloads.

User experience is also important within the hypervisor/container space: a growing number of companies, developers, and open source projects have become dependent on the simple life-cycle and user experience that Docker brought to the container world.

Providing “isolation flexibility” within container orchestration frameworks like Kubernetes is becoming increasingly important. Can and will we have a world where clusters can provide dynamic choice based on workload type? The answer is yes!



The 2019 news cycle here in our cloud native corner of the world has been abuzz with a word previously thought outmoded by the rapid rise of containers: "hypervisor." Near the close of 2018, Amazon amped up an already-increasing interest level in the marrying of container and hypervisor technology by announcing Firecracker, a Rust-based Virtual Machine Monitor (VMM) with an associated open source project, as well as integration with the CNCF containerd runtime project. This news was quickly followed by an announcement from the Kata Containers project — an existing open source virtualization-backed container runtime: they would be supporting Firecracker-backed virtualization in addition to their traditional qemu/kvm-based runtime.

As we rounded the corner into this year, blog posts and 2019 predictions included expectations that this might be the "year of the hypervisor." Both The New Stack and InfoQ had great roundups on the topic, distilling information from both Paul Czarkowski’s "The Future of Kubernetes is VMs" blog post and Chris Short’s "Kubernetes Will Start to Replace the Hypervisor" section of his 2018 close-out and 2019 predictions post.

Around the midpoint of this year, Weaveworks, and project creator Lukas Kaldstrom, introduced a new project, Ignite, which wraps Amazon’s Firecracker project with a container-lifecycle UI experience. Ignite generated a significant amount of interest within days — and even hours — of its announcement, leading to early proof of concepts from an integration with Virtual Kubelet to a community pull request to use Ignite within OpenWhisk, an open source serverless platform built on container-native function execution.

So what does all this mean as we continue with rapid adoption and hyper-ecosystem growth around Kubernetes and containers? Let’s try and break that down into a few key areas and see what all the excitement is about.

Security

At the risk of stating the obvious, the ever-present discussion around security and Linux container isolation has led to strong recommendations for applying "defense in depth" as developers implement container-based architectures. Many of these walls of defense are present within the orchestration and runtime layers we are using today; everything from seccomp profiles, to configured and enabled LSMs (AppArmor and SELinux), along with new focus areas around "rootless" and least privileged/limited capabilities for containerized processes.

To draw a picture that helps us understand the argument for additional isolation layers, let’s think about the shared Linux kernel underneath our containers as a number. We’ll pick the number of system calls available on average in the Linux kernel, which currently is around 340. The Docker engine default seccomp profile blocks 44 system calls today, leaving containers running in this default Docker engine configuration with just around 300 syscalls available. Of course what containers can do with those syscalls is still at the mercy of other layers of defense, like running the container without administrative privilege, removing more Linux capabilities from the process, additional restrictions applied via AppArmor or SELinux, and so on.

Returning to the number of syscalls, let’s describe this as the "attack surface" available to potentially malicious programs in a container running on our host. As a general rule, we should expect that smaller numbers are better, as that means a smaller attack surface available to containers. By that logic, adding a hypervisor between the host and the containerized process means effectively reducing that number to zero, although of course the software implementing the hypervisor (for example, a KVM-based virtualizer would have complete access to "/dev/kvm" and the hardware’s virtualization capabilities) will be using system calls on the host to implement these features. However, the code we run in the container should have no access through this VM barrier back to our protected host. Stepping out of our discussion of purely lightweight virtualization-based isolators for a moment, we observe that Google’s gVisor and IBM Research’s Nabla projects are also reducing this "300-or-so" syscall number to single digits. In the case of Nabla this number is specifically seven, and those interested in further investigation can look at this measurements project to see how to compare and determine the use of syscalls for various isolators.

In that vein, one clear use case for this new marriage of hypervisor technology and containers is to add a layer of protection and isolation to potentially sensitive (or untrusted) workloads for which we clearly want the smallest possible attack surface on our host. While even considering wrapping a container in a fully-booting virtual machine might have been a disastrous thought performance-wise a mere few years ago, this is no longer the case. The work that Intel did with Clear Containers — as well as Hyper.sh and their now-shuttered offering — to drastically reduce boot time and clear some of the historic cobwebs of how we boot virtual computers means boot times for these lightweight virtualized containers dropped within the sub-second expectations to which containers made us accustomed. Firecracker in applying similar streamlining via it’s Rust-based VMM claims to achieve 125ms boot times, even at significant microVM concurrency, claiming 150 VMs booted per second per host. Clearly if one can achieve an increase in process isolation with nearly similar performance characteristics, lightweight virtualization becomes a viable option for workloads in which this added security layer is a desired property.

Assuming you have workloads for which this level of isolation is attractive, the Kubernetes 1.12+ (as alpha) and Kubernetes 1.14+ (as beta) API included a new RuntimeClass resource which can be combined with the `runtimeClassName` selector in the Pod specification to point to a specific runtime as the "isolator of choice" to back a specific pod in Kubernetes. I demonstrated a fairly complete list of possible isolation choices using this feature at KubeCon EU in May 2019 including gVisor, Kata, Nabla, and Firecracker. While two of those are not lightweight hypervisors, all of them are providing an approach to answer the question "how can I get more isolation for my pods than Linux kernel namespaces and cgroups?" As public clouds provide more customization within their managed Kubernetes offerings, you can expect that registering and installing specific RuntimeClass options within your cluster will be an expected capability. Today, this can be achieved in a brute force way by directly accessing worker nodes, installing required pieces like the Kata or Firecracker components, and then using the RuntimeClass features within the existing Kubernetes API to select specific runtimes.

Public clouds, however, are unlikely to sit back and leave the usage of these additional isolation features to users to figure out how to configure and operate. Google Cloud has already announced that GKE Sandbox, Google Cloud Run, and Google Cloud Functions utilize gVisor today. When Google announced Cloud Run, the use of gVisor as an isolator received some significant interest, and various engineers from Google, including Ahmet Balkan, provided valuable details on how Google utilizes these capabilities within their platform.

User Experience

Security may not be the whole story for this new-found love of lightweight virtualization in the container ecosystem. While there are still detractors, container adoption data shows that a consistently growing number of companies, developers, and open source projects have become dependent on the simple life-cycle and user experience that Docker brought to the container world.

For the most part, the Dockerfile as a standard for defining our images, added to a growing mixture of available build tooling to assemble these into immutable and easily deployable units, has taken hold in our modern day DevOps culture. But what if users still have use cases for the capabilities provided via VMs within our brave new world — and specifically, with the ease and simplicity of the container image and runtime model? This is where Weave’s Ignite project seems to have captured the imagination of a hybrid model combining VMs and containers. With Ignite, you get a "Docker-like" user experience with a booted VM as the end result. Given your VM is still sourced — the user space file system to be specific — from a standard container image you can keep your CI/CD-driven build steps to produce a container image, but with Ignite, that container image easily becomes a VM, booted and managed with Firecracker’s Rust-based VMM, with the security and performance promises via that underlying project. Along the way Ignite marries your container image "user space" to a Linux kernel and init process — optionally one of your choosing — to create a fully bootable virtual machine. To quote the announcement blog post directly, with Ignite and the existing landscape of tooling you can imagine that "we can run a cloud of VMs ‘anywhere’ using Kubernetes for orchestration, Ignite for virtualization, GitOps for management, and supporting cloud native tools and APIs."

Given Ignite’s flexibility — for example, optionally providing your own kernel and/or kernel modules, packaged as a container image of course — this opens up opportunities for many interesting use cases, all accessible from modern container-native tooling: everything from traditional VM packaged workloads to edge computing to special test and ephemeral setups that can be spun up on-demand with none of the overhead and complexity of traditional VM image creation tools and processes. To be explicit, there are no vmdk files involved. There are no qcow or raw image tools required to work with Ignite. You are always working with container images, and Weave has already provided a default Ubuntu and CentOS image with some VM-friendly additions like sshd and systemd, for example. Networking is handled for you via CNI plugin, giving you the same networking capabilities that exist in any standard Kubernetes deployment. Ignite’s current implementation allows for port-forwarding to your VMs based on command-line flags or configuration file-defined parameters.

Isolation Flexibility

We’ve seen that "defense in depth" security initiatives and a container-native user experience for VMs are both elements leading to an increased interest in lightweight virtualization. But for the Kubernetes world where core container runtime choice is fairly static — nodes in your cluster are probably using Docker, maybe containerd, and optionally cri-o, but not multiples at the same time — can and will we have a world where clusters can provide dynamic choice based on workload type? The answer is yes!

I’ve already covered the recent RuntimeClass feature, added within the past few Kubernetes releases, which provides for a pod-level selection of registered runtimes. This replaces a deprecated and less flexible "untrusted workload" feature which could only be on or off for any given pod. On a properly installed and configured cluster, you can imagine within the same Kubernetes environment the ability to select — according to any self-defined criteria — various levels of workload isolation preferred from standard Linux containers to gVisor to unikernel-backed Nabla to either of the popular virtualization runtimes: Kata and Firecracker.

The CNCF containerd project has enabled this flexibility to an even greater degree by formalizing a shim API, allowing for 3rd party shim implementations to plug into a standard containerd installation, allowing the shim to drive the actual container lifecycle operations directly with an isolator’s technology-of-choice framework. For example, this means a shim specific to Kata can handle Kata’s management of boot-time VM parameterization when containerd asks the shim to "start a container." The main containerd daemon is not required to know anything about these details and therefore can remain simple and unaware of the actual containerization primitives in use. Open source project shims exist for gVisor, Kata, and Firecracker today, as well as the included containerd project v2 shim for Linux and Microsoft’s new shim implementation which arrived in the containerd 1.3 release last month.

While these features within Kubernetes and containerd are orthogonal to the actual capabilities of hypervisor-based isolators, providing a way for these isolation methods to be easily integrated at the container runtime and container orchestration layers will clearly be beneficial to broad adoption of lightweight virtualization as a technology. Marrying this ease of integration with the security and container user experience tenets already discussed provides three clear signposts pointing us to potential realization of the late 2018 and early 2019 predictions as "the year of the hypervisor." Production usage and cloud platform integration are also valuable markers for any technology, and the fact that AWS clearly expects Firecracker to be the underpinnings of their popular Lambda service, and the aforementioned gVisor usage across several Google Cloud properties seem to point in positive directions as well.

Summary

So far our treatment of this topic has heavily leaned towards the positive characteristics of this new containers-as-microVMs era. What isn’t there to like? Well, for operators/admins, developers looking for application debug tools, performance tuning experts, and associated roles, there are still characteristics of a running, booted virtual machines that reintroduce complications we had happily dismissed in our rush to the simplicity of a Linux process-as-container.

More concretely, with hypervisor-wrapped containers an additional Linux kernel is being provided by these new runtimes (or the user), and that kernel require patching and maintenance, and has all the usual exposures of running a Linux kernel. Additionally, we re-introduce a hypervisor boundary between our running application and the host-side networking and storage hardware. These are problems that have been worked on and improved for decades with special virtual IO drivers, utilizing PCI pass-through and other techniques, that now are relevant again in our lightweight virtualization-wrapped containers. To be clear, smart and capable minds are already working on these problems — case in point, see this presentation from Red Hat discussing the state of the virtio-fs project for Kata containers. We also have to be aware of the security threat model and potentially required hardening for our new hypervisor stack itself; knowing that by adding isolation we aren’t allowed to simply ignore the large codebase we have just inserted into our software stack.

That said, I think we can all agree that hypervisors are back, and they are married to containers! There’s a growing and significant level of excitement for how these pieces will play a role in specific use cases and we are already seeing public cloud platforms create and adopt some of these new isolation techniques. What will 2020 bring to this exciting space? We only have a few months left of 2019 before we find out, and for sure it doesn’t look like anyone will have to eat their words about the rise of the significance and use of the hypervisor in 2019.

About the Author

Phil Estes is a Distinguished Engineer & CTO, Container and Linux OS Architecture Strategy for the IBM Watson and Cloud Platform division. Phil is currently an OSS maintainer in the Docker (now Moby) engine project, the CNCF containerd project, and is a member of both the Open Container Initiative (OCI) Technical Oversight Board and the Moby Technical Steering Committee. Phil is a member of the CNCF Ambassadors program and has broad experience in open source and the container ecosystem. Phil speaks worldwide at industry and developer conferences as well as meetups on topics related to open source, Docker, and Linux container technology. Phil blogs regularly on these topics and can be found on Twitter as @estesp.