This post was the basis for a joint event with the grokking engineering community in Saigon.

The event was centered around DevOps, for our talk Docker Saigon needed to interest an engineering audience with how things tick on the inside of Docker. Audience experience with Docker and Linux operating systems was expected.

Anyone interested to learn more about Docker, Full free hands-on-labs training events are scheduled for 23⁄ 24 March. For more details go to meetup.com/Docker-Saigon

Outline

Overview of Linux containers

The target of this section is to give a very short overview of containers from a Linux system perspective, it is not meant as an introduction to users unfamiliar with Docker nor people unfamiliar with Linux systems.

Developers in Saigon looking to find out more on how they can get started with Docker, are referred to the excellent Installation Guide (OSX/Windows) and User Guides available on the Docker website.

Anyone wondering if/why Docker matters, is invited to contact the Docker Saigon user group (preferably through our Slack auto-invite app) for discussion.

What is a container?

In 4 bullet points:

Containers share the host kernel

Containers use the kernel ability to group processes for resource control

Containers ensure isolation through namespaces

Containers feel like lightweight VMs (lower footprint, faster), but are not Virtual Machines!

Components of a container ecosystem include:

Runtime

Image distribution

Tooling

Now… if you look in the Linux kernel , there is no such thing as a container… so what gives?

History of Container Technology

Chroot circa 1982

FreeBSD Jails circa 2000

Solaris Zones circa 2004

Meiosys - MetaClusters with Checkpoint/Restore 2004-05

Linux OpenVZ circa 2005 (not in mainstream Linux)

AIX WPARs circa 2007

LXC circa 2008

Systemd-nspawn circa 2010-2013

Docker circa 2013 built on LXC moved to libcontainer (March 2014) appC (CoreOS) announced (December 2014) Open Containers standard for convergence with Docker Announced (June 2015) moved to runC (OCF compliant) (July 2015)

… many more container formats coming?

Reference slide deck

How do containers compare to Package Managers?

Why are containers different from package management?

Packaging into an image is similar to an RPM, but apart from the Linux distributions - software is rarely packaged correctly.

The big innovation of Docker is that it is a slightly easier to use Package Manager. Package managers failed us due to shared libraries version differences causing dependency issues, packaging shared libraries in an image goes around that.

What is missing?

Package managers provide an easy way to find out what is inside the packages. If you are wondering how to handle this with Container Images…

See the Dockercon EU talks where a system of meta data tags was suggested for image inspection. See Shipping Manifests, Bill of Lading and Docker Metadata and Containers - Video

How do containers compare to Configuration Management?

Configuration Management utilities provide the ability to store Infrastructure as code. Popular CM tools include:

Several of the above tools are often still procedurally provisioning the environment as opposed to distributing a package which is self-contained and runs in exactly the same way on the same architectures in every environment (environments may differ in Linux distribution [Ubuntu/redhat/..], scale [local laptop / server cluster/ …], …).

However, it is still advisable to leverage such a provisioning tool to bootstrap the Docker infrastructure, letting the Container Runtime layer take care of the application layer once it is ready.

In Summary, I believe the following key points drive the adoption of Docker containers:

Docker provides a self-contained image that is exactly that same image running on your laptop vs in the cloud while i.e. Puppet/Chef are procedural scripts that need to rerun to converge your cluster machines. This enables approaches also know as Immutable Infrastructure or Phoenix Deploys.

Docker is really fast, to stand up a container takes very few seconds! There is very little overhead (cpu, memory, io, image footprint, ..) enabling high density (such as running a full stack of containers on your laptop, if you use Puppet/Chef, you’d need to create several VM’s with a much heavier footprint).

The community adopted Docker quickly due to the ease of how to build an image, the Dockefile DSL is very simple and very powerful (you can use pure bash to build the image or you can use load python scripts or anything similar you are familiar with for machine configuration.

Why Docker?

Docker is currently the only ecosystem providing the full package:

Image management

Resource Isolation

File System Isolation

Network Isolation

Change Management

Sharing

Process Management

Service Discovery (DNS since 1.10)

How?

The target of this section is to have a very detailed look into each component in the Linux stack which make Linux Containers possible.

A higher level overview is available (and was used as a reference) in the Official Docker documentation)

UPDATE: See Also jfrazelle’s talk @ container summit February 2016

Kernel Namespaces

Allow you to create isolation of:

Process trees (PID Namespace)

Mounts (MNT namespace) wc -l /proc/mounts

Network (Net namespace) ip addr

Users / UIDs (User Namespace)

Hostnames (UTS Namespace) hostname

Inter Process Communication (IPC Namespace) ipcs Notable example using IPC = PostgreSQL

Cgroups

Kernel control groups (cgroups) allow you to do accounting on resources used by processes, a little bit of access control on device nodes and other things such as freezing groups of processes.

Ref DockerCon EU: jpetazzoni: What are containers made from, we attempt to provide here a summarized overview of this excellent presentation.

cgroups consist of one hierarchy (tree) per resource (cpu, memory, …) . for example:

cpu memory ├── batch ├── 109 │ ├── hadoop ├── 88 < │ │ ├── 88 < ├── 25 │ │ └── 109 ├── 26 └── realtime └── databases ├── nginx ├── 1008 │ ├── 25 └── 524 │ └── 26 ├── postgres │ ├── 524 └── redis └── 1008

We can create sub groups for each hierarchy, in the example above custom batch and realtime sub groups for the cpu resource were created. Each process will be in 1 node for each resource ( pid 88 is in a node for the memory resource as well as the cpu resource, …)

Note: cgroups are system wide The feature is enabled / disabled at boot time and can not be controlled on a per process level

A closer look at each resource tree:

Memory cgroup:

The Memory resource provides 3 types of functionality: Accounting, Limits & Notifications

Accounting

granularity = memory page size (4kb depending on the architecture)

2 type of memory pages:

file pages : loaded from disk (important because we know the data is still on disk and can be removed from memory, no need to swap when memory needs to be reclaimed)

: loaded from disk (important because we know the data is still on disk and can be removed from memory, no need to swap when memory needs to be reclaimed) anonymous pages: memory that does not correspond to anything on disk, for this type we have to swap out if we want to reclaim this memory

Some pages can be shared, for example: multiple processes reading from the same files.

Create 2 pools for all pages:

Active

Inactive pages - keep often accessed pages into active set.

Each page is accounted to a group, shared pages are only accounted to 1 group and re-allocated to another group if that group goes away.

Limits

Each group optionally has 2 type of limits:

Hard limits : If the group goes above its hard limit, the group gets killed with an out of memory error. (which is why it is a good practice to put a single process in a container)

Soft limits: not enforced… except when the system starts to run out of memory. The more a process goes over its soft limit, the higher the chance pages get reclaimed for its group

There are 3 kinds of memories on which limits can be applied:

physical memory

kernel memory: to avoid processes abusing the kernel to allocate memory

total memory

Note

oom-notifier Provides a mechanism to give control to a user program to handle a group going over its limits by freezing the processes in the group and notifying user space. At this point the program handling the notification could kill the container, raise the limits or migrate the container.

Overhead: Each time the kernel gives or takes a page to or from a process, counters are updated.

HugeTBL cgroup

Accounting for usage of huge pages by process group, ignoring for now..

CPU cgroup

keeps track of user/system CPU time

keeps track of usage per CPU

allows to set weights - not limits Why no limits? On an idle host a container with low shares will still be able to use 100% of the CPU

CPUSet cgroup

Bind group to specific CPU

Useful for:

Real Time applications

NUMA systems with localized memory per CPU

BlkIO cgroup

Measure & Limit amount of blckIO by group, unless your processes do direct IO - setting limits may give surprising results.

net_cls and net_prio cgroup

Kernel will only tag the traffic and you are responsible for doing traffic control ( tc )

Devices cgroup

Controls which group can read/write access devices. Can be used to prevent groups to read/write directly to disk drives, very important for containers

Typically with containers access to /dev/{tty,zero,random,null} are allowed and everything else is denied.

Why /dev/random ? Because if you are generating encryption keys inside a container, you will quickly deplete the entropy unless you read it from the host..

Other interesting devices for containers:

/dev/net/tun if you want to do anything with vpn’s inside a container without polluting the host

if you want to do anything with vpn’s inside a container without polluting the host /dev/fuse custom filesystems in a container

custom filesystems in a container /dev/kvm to allow virtual machines to run inside a container

to allow virtual machines to run inside a container /dev/dri & /dev/video for GPU access in containers - (see NVIDIA/nvidia-docker).

Freezer cgroup

Freeze a whole group without sending SIGSTOP/SIGCONT to the group (without interfering in the process).

Useful for:

cluster batch scheduling

process migration - think CRIU

debugging without affecting prtrace

How to manage cgroups with Systemd?

By setting the ControlGroupAttribute in the unit file:

.include /usr/lib/systemd/system/httpd.service [Service] ControlGroupAttribute=memory.swappiness 70

Or temporarily on a running process through:

systemctl set-property <group> CPUShares=512

To show all properties of an existing group:

systemctl show <group>

The above commands go behind the Docker daemon and may result in unexpected behaviour (i.e.: settings are reverted on container restarts)

Note: Docker 1.10 introduced the docker update command to change cgroup limits on the fly for certain attributes: Usage: docker update [OPTIONS] CONTAINER [CONTAINER...] Updates container resource limits --blkio-weight=0 Block IO (relative weight), between 10 and 1000 --cpu-shares=0 CPU shares (relative weight) --cpu-period=0 Limit the CPU CFS (Completely Fair Scheduler) period --cpu-quota=0 Limit the CPU CFS (Completely Fair Scheduler) quota --cpuset-cpus="" CPUs in which to allow execution (0-3, 0,1) --cpuset-mems="" Memory nodes (MEMs) in which to allow execution (0-3, 0,1) -m, --memory="" Memory limit --memory-reservation="" Memory soft limit --memory-swap="" Total memory (memory + swap), '-1' to disable swap --kernel-memory="" Kernel memory limit: container must be stopped

How does the kernel expose cgroups?

Groups are created through a pseudo file system, this is how systemctl applies your configuration changes:

mkdir /sys/fs/cgroup/memory/somegroup/subcgroup

To move a process, just echo the process id to the special tasks file in the path of the group:

echo $PID > /sys/fs/cgroup/.../tasks

IPTables (networking)

Isolation on the networking level is achieved through the creation of virtual switches in the linux kernel. Linux Bridge is a kernel module, first introduced in 2.2 kernel (circa 2000). And it is administered using the brctl command on Linux.

Linux bridges are heavily used for the setup of Linux virtualization & Software Defined Networking (SDN).

Network shaping and bandwidth control for Linux containers can be achieved through the use of existing technology such as tc , I will not attempt to cover this here.

Below is a quick demo on how Docker uses the Linux Bridge together with IPTables functionality to create isolated Container networks and expose container ports.

Container networking and port forwarding

We will be using an Alpine image with DNS tools such as dig and an exposed port:

docker build -t so0k/envtest - << EOF FROM alpine:latest MAINTAINER Vincent De Smet <vincent.drl@gmail.com> RUN apk --update add bind-tools && rm -rf /var/cache/apk/* EXPOSE 80 EOF

Create a test network

docker network create test

Run 2 containers to demonstrate the resulting Linux configuration:

docker run --net test -dit --name host1 -P so0k/envtest sh

docker run --net test -dit --name host2 -P so0k/envtest sh

docker ps

Overview of Linux bridges & IPtable rules:

brctl show

sudo iptables -nvL

Notice a port has been opened for each port exposed within the container image:

ss -an | grep LISTEN

With the default Docker configuration, a userland docker-proxy process is used:

ps -Af | grep proxy

be careful if you need to open a lot of ports…

docker run --net test -dit --name prangetest -p 76-85:76-85 so0k/envtest sh

Memory usage by these proxies:

ps -o pid,%cpu,%mem,sz,vsz,cmd -A --sort -%mem | grep proxy

You can disable the userland docker-proxies forcing Docker to usee the Linux kernel ‘hairpin’ forwarding mode (kernel >=3.6) with alternative iptable rules. This will improve network performance and memory usage. Note if you do not use the docker-proxy - your other containers may not be able to connect without hairpin NAT setup…

Next, demonstrate some simple “Service Discovery” provided within Docker networks:

docker exec -it host1 ping host2

docker exec -it host2 netstat -an

docker exec -it host1 dig host3 +noall +answer +stats

Notice how the container has been re-configured by Docker for name resolution:

docker exec host2 cat /etc/resolv.conf

The dns process was injected into the container:

docker exec -it host2 netstat -an

more info on configuration of the embedded DNS. notice we can create container aliases and still create private links between containers where required.

Let’s demonstrate the isolation between separate container networks:

docker network create test2

docker run --net test2 -dit --name host3 -P so0k/envtest sh

docker run --net test2 -dit --name host4 -P so0k/envtest sh

Notice another Linux bridge was created for this network:

brctl show

sudo iptables -nvL

Confirm containers on the first network can not reach containers on the second network. (to really confirm this use the actual container IPs instead of hostnames)

docker exec -it host1 ping host4

Name Resolution was introduced with Docker 1.10 in Q1 2016. The Docker DNS server is not exposed to containers connected to the default Docker bridge for backwards compatibility. (Running containers without the --net parameter puts them on the default bridge):

docker run -dit --name def-host1 -P so0k/envtest sh

docker run -dit --name def-host2 -P so0k/envtest sh

No name resolution:

docker exec -it def-host1 cat /etc/resolv.conf

docker exec -it def-host1 hostname

docker exec -it def-host1 cat /etc/hosts

If these containers need to find each other, use links, just like it used to be before Docker 1.10

docker run -dit --name def-host3 --link def-host1 -P so0k/envtest sh

docker exec -it def-host3 cat /etc/hosts

If you want to expose additional ports to the public, here is an example for the containers connected to the Default bridge:

#forward packets from port 8001 on your host to port 8000 on the container iptables -t nat -A DOCKER -p tcp --dport 8001 -j DNAT --to-destination ${CONTAINER_IP}:8000

Let’s revise the cgroup setup of all the containers created above as seen earlier:

sudo systemd-cgls

Security

Currently no examples provided in this document… This is subject for further study.

Types of Containers

Given the above constructs, containers may be divided into 3 types as follows:

System Containers share rootfs, PID, network, IPC and UTS with host system but live inside a cgroup.

Application Containers live inside a cgroup and use namespaces (PID, network, IPC, chroot) for isolation from host system

Pods use namespaces for isolation from host system but create sub groups which share PID, network, IPC and UTS except the rootfs.

Note, current Pod implementations on top of Docker are sub optimal as a work around is needed to allow the sub groups to share namespaces (this is implemented through a sleep container which is essentially pid 1). Ideally something like systemd is used as the PID 1 to share the namespaces between the sub groups and chroot to separate the rootfs. Reference Brandon Philips: Where We Are and Where We Are Going

Images & Layers

Images you create yourself or images created by others are stored in Docker Registries. These are public or private stores from which you upload or download images. Docker registries are the distribution component of Docker.

There are 3 choices for use of a Registry:

A Public Cloud-hosted registry. The Docker Hub is the default registry used by the docker client and source of Officially maintained Docker images, however alternatives exists such as Quay.io. Limited Private repositories may be created or purchased to enable a quick Docker adoption.

An On-premise registry, through the commercially offered Trusted Docker Registry, providing advanced configuration options, Logging, usage and system health metrics and much more…

A Self-hosted registry based on the official Open Source Docker Registry. This is a fully functional Registry which you can fully setup by yourself and is the basis on which the Docker Trusted Registry is built, but it does not provide advanced monitoring & access control as well as requires manual maintenance.

Each Docker image references a list of read-only layers that represent filesystem differences. Layers are stacked on top of each other to form a base for a container’s rootfs.

When the container starts, the Docker engine prepares the rootfs & uses chroot for the container filesystem isolation - similar to LXC. One big innovation of the Docker engine was the concept of leveraging Copy-On-Write file systems to significantly speed up the preparation of the rootfs.

Copy-On-Write

Before Docker, LXC would create a full copy of FileSystem when creating a container. This would be slow and take up a lot of space

When Docker creates a container, it adds a new, thin, writable layer on top of the underlying stack of image layers. This layer is often called the “container layer”.

All changes made to the running container - such as writing new files, modifying existing files, and deleting files - are written to this thin writable container layer.

by not copying the full rootfs, Docker reduces the amount of space consumed by containers and also reduces the time required to start a container. Below is a diagram showing multiple containers and its “container layer”, sharing

Union File Systems provide the following features for storage:

Layering

Copy-On-Write

Caching

Diffing

By introducing storage plugins in Docker, many options are available for the Copy-On-Write functionality, for example:

OverlayFS (CoreOS)

AUFS (Ubuntu)

device mapper (RHEL)

btrfs (next-gen RHEL)

ZFS (next-gen Ubuntu releases)

A quick overview on when to choose which, is provided here, full details are on the excellent Docker Docs

AUFS: PaaS-type work:

Pro Con stable high write activity production ready not in mainline kernel good memory use smooth Docker experience

Aufs3 default & recommended for Ubuntu currently

devicemapper (direct-lvm): Paas-type work:

Pro Con stable ?? production ready in mainline kernel smooth Docker experience

The most stable configuration for production environments on RHEL, but requires daemon flags to overwrite the defauts.

devicemapper (loop): Lab testing - this is default in Docker on RHEL, not recommended for production

Pro Con stable production in mainline kernel performance smooth Docker experience

Using a loopback mounted sparse file, additional codepaths and overhead does not suit I/O heavy workloads.

OverlayFS: Lab testing

Pro Con stable container churn in mainline kernel good memory use

Hailed as the future, default on CoreOS, but less mature and thus potentially less stable…

but… ionodes problems if there is high rate of containers creation/removal so, not good for build pools..

Btrfs: Build Pools

Pro Con in mainline kernel high write activity container churn

Overview of Container Runtimes

The target of this section is to play with other container runtimes (some of the past, some alternatives to Docker and some upcoming implementations)

LXC

Originally used by Docker as backend until libcontainer replaced it.

Installing:

install bridge-utils libvirt lxc lxc-templates

Available commands

lxc-attach lxc-config lxc-freeze lxc-start lxc-usernsexec lxc-autostart lxc-console lxc-info lxc-stop lxc-wait lxc-cgroup lxc-create lxc-ls lxc-top lxc-checkconfig lxc-destroy lxc-monitor lxc-unfreeze lxc-clone lxc-execute lxc-snapshot lxc-unshare

Quick Guide to use an LXC based container of busybox

wget https://www.busybox.net/downloads/binaries/busybox-x86_64 -o busybox chmod a+x busybox PATH=$(pwd):$PATH lxc-create -t busybox -n mycontainer lxc-start -d -n mycontainer lxc-console -n mycontainer # (use CTRL-A Q to exit console mode) lxc-stop -n mycontainer lxc-destroy -n mycontainer

Interesting Read: Linux Containers without Docker using OverlayFS & Ansible.

the LXC project has been working on a more user-friendly Daemon similar to the Docker daemon called LXD since November 2014.

Systemd-nspawn

Originally created to debug the Systemd init system, future versions to be more integrated in the core of the OS (the most low-level and minimal approach to make containers native to the OS).

CoreOS Toolbox uses systemd-nspawn and CoreOS rkt builds on top of it.

Installing:

Included with all recent Linux distribution releases..

Commands available

systemd-analyze systemd-delta systemd-nspawn systemd-ask-password systemd-detect-virt systemd-run systemd-cat systemd-cgls systemd-loginctl systemd-sysv-convert systemd-cgtop systemd-machine-id-setup systemd-coredumpctl systemd-notify systemd-tty-ask-password-agent systemd-inhibit systemd-stdio-bridge systemd-tmpfiles systemdctl machinectl hostnamectl journalctl

Quick Guide to a container deployment using systemd-nspawn

# Create an Image (fedora) sudo yum -y --releasever=7 --nogpg --installroot=/mycontainers/centos7 \ --disablerepo='*' --enablerepo=fedora \ install systemd passwd yum fedora-release vim-minimal # Change the root password in the image (through a shell in the rootfs) sudo systemd-nspawn -D /mycontainers/centos7 passwd exit # Start the container as if booting into the container image sudo systemd-nspawn -bD /mycontainers/centos7 -M mycontainer --bind /from/host:/in/container # Get list of containers registered with machine machinectl list machinectl status mycontainer # log into the container machinectl login mycontainer # or enter the running namespace nsenter -m -u -i -n -p -t $PID

see also - Docker without Docker see also - Rubber Docker Workshop - Prep Slides

runC

Spun out via libcontainer from Docker Engine and made OCI compliant, currently core of Docker Engine

Installing runC

apt-get update && apt-get install libseccomp2 curl -Lo /usr/local/bin/runc https://github.com/opencontainers/runc/releases/download/v0.0.8/runc-amd64 chmod +x /usr/local/bin/runc

Building & Installing

On Digital Ocean Ubuntu 14.04 with Docker 1.10 image:

Build dependencies:

apt-add-repository -y ppa:evarlast/golang1.4 apt-get update apt-get install make gcc g++ libc6-dev libseccomp-dev golang

Procedure

cd ~ git clone https://github.com/opencontainers/runc cd runc GOPATH="$(pwd)" PATH="$PATH:$GOPATH/bin" make make install cd ~

Commands available

checkpoint pause delete restore events resume exec spec kill start list help

Quick guide to container deployment using runc & Docker shipping. Keep in mind that the Docker Engine does all of the below behind the scenes for us and appreciate the level of comfort it provides.

# Download an OCF compliant image (using docker for example) docker pull busybox # Create busybox/rootfs mkdir -p busybox/rootfs # Flatten the image layers & copy to rootfs tmpcontainer=$(docker create busybox) docker export $tmpcontainer | tar -C busybox/rootfs -xf - docker rm $tmpcontainer # Generate container spec file cd busybox/ runc spec # start the container runc start test # confirm we are now in busybox container /bin/busybox ps -a

Alternatively download image layers from a registry using tianon’s script download-forzen-image-v2.sh

Or with debootstrap …

cd ~ apt-get install debootstrap mkdir -p debian_wheezy/rootfs debootstrap --arch=amd64 wheezy debian_wheezy/rootfs cd debian_wheezy runc spec runc start debian

You can use post-start hooks (in config.json ) to call additional binaries/scripts to do things such as set up the virtual bridge and veth pair and iptable rules for your container.

Docker API

The target of this section is to give an overview of how we might hook in to the various Docker components to leverage some of its notification systems. This is purely to quench the thirst of engineers looking to understand platforms built on top of Docker.

Many existing platforms already provide orchestration layers and it is advisable to research existing solutions before implementing your own using these events.

Docker Engine

Events

Use Cases:

Docker Registry

Notifications through webhooks:

Use Case: conduit

Conduit exposes an endpoint that receives webhooks (i.e. from Docker Hub). Upon receiving the hook, Conduit will pull the new image, deploy a new container from the updated image and then remove the original container.

Docker Compose

Via stdout

See: Docker Compose events docs & PR

Sample gist (from PR):

#!/bin/bash set -e function handle_event() { local entry="$1" local action=$(echo $entry | jq -r '.action') local service=$(echo $entry | jq -r '.service') local hook="./hooks/$service/$action" if [ -x "$hook" ]; then "$hook" "$entry" fi } docker-compose events --json | ( while read line; do handle_event "$line" done )

Container Format explosion

As Docker made containers easy, an ecosystem emerged with an incredible amount of contributions towards the Docker standard.

However, different opinions exist concerning the exact requirements & responsibilities of each layer within a Container infrastructure with many big players looking to take a piece of the pie - divergence was to be expected.

The target of this section is to have a look at future and upcoming infrastructures. Out of these, Docker is currently (end 2015) the most mature and the easiest for beginning users to get started with.

Containerd (Alpha) - By Docker

See containerd.tools - Spinning out the Docker Daemon into a more advanced and OCI compliant Daemon to control runC.

Uses GRPC

A high performance, open source, general RPC framework that puts mobile and HTTP/2 first.

Containerd is the plumbing component that will manage containers in a future version of Docker Engine.

curl -Lo /usr/local/bin/containerd https://github.com/docker/containerd/releases/download/0.0.5/containerd curl -Lo /usr/local/bin/ctr https://github.com/docker/containerd/releases/download/0.0.5/ctr curl -Lo /usr/local/bin/containerd-shim https://github.com/docker/containerd/releases/download/0.0.5/containerd-shim chmod +x /usr/local/bin/{containerd,ctr,containerd-shim} nohup containerd >/dev/null 2>&1 &

Create redis image using Docker to pull from hub

mkdir -p redis/rootfs docker pull redis tmpredis=$(docker create redis) docker export $tmpredis | tar -C redis/rootfs -xf - docker rm $tmpredis

Prepare the OCI bundle:

generate config.json

runc spec

edit config.json :

terminal: false

populate uid & guid

set args : “redis-server”, “–bind”, “0.0.0.0”

: “redis-server”, “–bind”, “0.0.0.0” set correct cwd

edit runtime.json :

remove network namespace for now to allow easy connections from localhost for testing…

see config.json & runtime.json from containerd repository

Or generate bundles from Docker container definittions with jfrazelle/riddler

OCI (OpenContainers Initiative)

OCI currently only covers the Runtime

Doesn’t cover how an image is defined, may cover Identity confirmation

Docker provided tech draft and implementation of OCI in runC (moving libcontainer to runC in the process).

OCI? (simple tarballs of the layers+metadata being pushed)

OCI and link with AppC?

The individuals behind the appc effort are joining the technical leadership of the OCI, and our intention is to work towards both a common format that is compatible with existing container formats as well as to work on a future spec that combines the best elements of all the existing container efforts.

See also CoreOS announcement & Docker announcement

Creating and maintaining formal specifications (“OCI Specifications”) for container image formats and runtime, which will allow a compliant container to be portable across all major, compliant operating systems and platforms without artificial technical barriers.

The idea behind OCI was to take the widely deployed runtime and image format implementation from docker and build an open standard in the spirit of appc.

AppC - By CoreOS

Ref (June 2015) Ref (Nov 2015)

Image format (ACI) and Identity, initially based on Docker image format

Container Signing

Discovery mechanism allowing to easily store images and find where the images are (no default registry, no special registry)

Runtime environment: defined behavior on running the images.

Tooling: No fancy tooling required. For example, building is easy with command line tools tar , gzip and gpg to sign them

Image Format

ACI (ref AMI) needs to contain all files and metadata needed to execute a given app.

Notable difference with Docker: ACIs need to specify the mount points…

Docker doesn’t require you to specify the volumes, it gives flexibility but you can’t read the image manifest and know all the required mountpoints. AppC can force volumes to be defined at run time and fail if they have been omitted.

rootfs: Same as in Docker image format. Could be an existing system, tarred up. Could be generated with docker build . Could be build with native system tools Debian/Redhat tools to build full systems in a chroot.

image manifest: all defined fields defined on the AppC repo. Key points are the concept of labels could be used to define the kernel requirements (Containers share the kernel) and explicit requirement on mountpoint definitions.

Images should be content addressable and share layers.

Discovery Mechanism

Translates an ACI name into a download-able image. All ACIs must have a detached signature and do a verification process.

Could be by convention using a template on the runtime.

Could be by probing a metadata endpoint to retrieve the discovery mechanism (if you want to use a different protocol, for example bittorrent to distribute your images)

Runtime Environment

AppC defines how ACIs are executed on a host. Fundamental concept is to allow multiple images to be running inside a container and define recovery policy for each image instance within the container.

Defines:

FileSystem layout: uses the concept of Pod = ability to compose a collection of containers into a single execution unit.

Volumes: There is a specific requirement to specify all mountPoints and it is the executer task to do that

Networking (CNI): network plugins

Resource Isolators: all cgroups should be defined when executing a container

Logging: Runtime is responsible for having logs for all the Pods and the containers running in them

Tooling

Providing actool which allows you to actool build , actool cat-manifest , actool validate …

You can build with actool or the commandline tool listed above

Runtime may be able to convert Docker images on the fly, or you could use tools such as docker2aci to convert Docker images, deb2aci to convert packages … for you.

Image content verification, initial naive implementation is to use detached gpg signatures (basically you define what publicly signed hash you expect when downloading things over the internet), which is not ideal.

Upcoming standard for image verification is The Update Framework (TUF), which is adopted by Docker through Notary. TUF is similar to yum index / apt repo. Essentially a JSON file providing metadata of all images in a registry together with cryptographic metadata for verification once downloaded.

Existing Implementations of AppC

rkt

works in 3 stages:

stage0: get the image, unpack, verify, ..

stage1: runs the image (with nspawn) - currently launch systemd init system, processes run directly in process tree under assigned cgroup (not via a daemon).

stage2: applying the isolators

Comparison vs rkt & Docker: