Namespace isolation is the simplest virtualization technology available in Linux kernel. It allows a process and all its descendants to have their own private view of the globally shared kernel resources, such as the network stack, process table, mount table. This feature is mostly popularized and promoted by utilities such as LXC (Linux Containers), Docker and virtenv.

Three syscalls are used to create Linux namespaces, unshare(), clone() and setns(). In this article I will take a look at unshare() and show how to use it directly in your scripts and programs without going through LXC or any other higher level virtualization tool.

I’ll start by investigating unshare command available in util-linux package, and from there I’ll move to the system call. In the end I’ll build a small C program that isolates a web browser such as Mozilla Firefox into a kernel namespace.

/bin/bash

unshare command is basically a wrapper for the system call. I run it as root, and instruct it to create a new mount namespace in a standard bash session. As I still have the old filesystem in my new namespace, the first operation would be to mark the mount point a slave to my original filesystem.

$ su Password: # cd ~ # unshare --mount /bin/bash # mount --make-rslave /

From this point on, I can go and harden my filesystem. I’ll start by making /bin, /sbin, /lib, /lib64, /usr and /etc read-only.

# mount --bind /bin /bin # mount --bind -o remount,ro /bin # mount --bind /sbin /sbin # mount --bind -o remount,ro /sbin # mount --bind /lib /lib # mount --bind -o remount,ro /lib # mount --bind /lib64 /lib64 # mount --bind -o remount,ro /lib64 # mount --bind /usr /usr # mount --bind -o remount,ro /usr # mount --bind /etc /etc # mount --bind -o remount,ro /etc

A small read/write test is in order:

# ls > /bin/ttt bash: /bin/ttt: Read-only file system #

I also replace my home directory with an empty one based on tmpfs. The contents of this directory will be lost once the bash session is ended.

# mount -t tmpfs -o size=100m tmpfs /home/netblue

Alternatively, you could consider creating a brand new / directory using debootstrap and chrooting into it, or you can use any other method for building chroot jails.

With the filesystem all set, start your daemon programs, or su into your regular user and start GUI programs – for example Firefox:

# su netblue $ cd ~ $ firefox &

When you are finished, close all running programs and type exit enough times to close the bash session.

C programming

Most commands implemented in linux-utils package are basically wrappers for various Linux system calls. This makes it easy to get a skeleton of your application coded in bash and translated later into C. A small C application equivalent to all the scripting above looks like this:

#define _GNU_SOURCE #include <sys/utsname.h> #include <sched.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mount.h> #include <assert.h> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE);} while (0) static void mnt_rdonly(const char *dir) { assert(dir); // mount --bind /bin /bin if (mount(dir, dir, NULL, MS_BIND|MS_REC, NULL) < 0) errExit(dir); // mount --bind -o remount,ro /bin if (mount(NULL, dir, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY|MS_REC, NULL) < 0) errExit(dir); } static void mnt_tmpfs(const char *dir) { assert(dir); // mount -t tmpfs -o size=100m tmpfs if (mount(NULL, dir, "tmpfs", 0, NULL) < 0) errExit(dir); } int main(int argc, char **argv) { int i; if (unshare(CLONE_NEWNS /*| CLONE_NEWNET | CLONE_NEWIPC */) < 0) errExit("unshare"); // mount --make-rslave / if (mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL) < 0) errExit("mount slave"); mnt_rdonly("/bin"); mnt_rdonly("/sbin"); mnt_rdonly("/lib"); mnt_rdonly("/lib64"); mnt_rdonly("/usr"); mnt_rdonly("/etc"); mnt_tmpfs("/home/netblue"); chdir("/"); execlp("/bin/bash", "/bin/bash", NULL); return 0; }

The program starts with a call to unshare(), where CLONE_NEWNS requests a new mount namespace.

if (unshare(CLONE_NEWNS) < 0) errExit("unshare");

The mount is marked as a slave, followed by read-only and tmpfs mounts as discussed above.:

// mount --make-rslave / if (mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL) < 0) errExit("mount slave"); mnt_rdonly("/bin"); mnt_rdonly("/sbin"); mnt_rdonly("/lib"); mnt_rdonly("/lib64"); mnt_rdonly("/usr"); mnt_rdonly("/etc"); mnt_tmpfs("/home/netblue");

The program ends by starting a bash session.

execlp("/bin/bash", "/bin/bash", NULL);

Conclusion

unshare() system call allows a process to disassociate parts of its execution context that are currently being shared with other processes. The call can be used to build chroot environments inside applications. The extra source code required is usually minimal.

The idea of including chroot support inside the application has been around for ages, some application examples are OpenSSH, vsftpd and ProFTPD.

systemd allows the user to define and attach a chroot jail to any application initialized and managed. It supports /tmp directory isolation (PrivateTmp), private networking stack (PrivateNetwork), and limited directory access (ReadWriteDirectories, ReadOnlyDirectories, InaccessibleDirectories). The feature is implemented using unshare() system call.

More chroot/unshare() examples can be found in this article. An implementation of these ideas is pam_namespace PAM module (man 8 pam_namespace) . The module sets up a private mount namespace for a session with polyinstantiated directories. A polyinstantiated directory provides a different instance of itself based on user name.

Related posts