With all the talk about debian choosing a default init system (link, link), I’ve decided to share with the world a little project I’ve been working on to help me understand /sbin/init aka. PID 1.

In this blog post I will go step by step showing what an init system must do to be functional. I will ignore all the legacy SysVinit stuff, and technical nuances, and just concentrate on what’s really important.

Introduction

First of all, what is ‘ init ‘? In it’s essence it’s a process that must be running at all times, if this process ends, the kernel enters into a panic mode, after which you cannot do anything else, except rebooting.

This process doesn’t need to do anything special, you can use /bin/sh as your init, or even /bin/yes (although the latter wouldn’t be very useful).

So let’s write our very first init.

#!/usr/bin/ruby Process .spawn( 'agetty' , 'tty1' ) sleep

Believe it or not, this is actually a rather useful init. How useful it is depends on how your kernel was compiled, your partitioning scheme, and if your root file-system is mounted rw or not. But either way, it covers the basics: rule #1; always keep running no matter what.

This is almost true, except that we need to be listening for SIGCHLD, otherwise some processes wouldn’t be cleaned up properly, so:

Signal .trap( :SIGCHLD ) do loop do begin status = Process .wait( - 1 , Process :: WNOHANG ) break if status = = nil rescue Errno :: ECHILD break end end end

Reboot

Now that we have the running indefinitely under control, it’s time to stop running (only when requested), but in order to do that we need some kind of IPC with the running process. There’s many ways to achieve this, but I chose UNIX sockets to do that.

So instead of sleeping forever, we listen for commands issued to /run/initctl :

begin server = UNIXServer .open( '/run/initctl' ) rescue Errno :: EADDRINUSE File .delete( '/run/initctl' ) retry end loop do ctl = server.accept cmd = ctl.readline.chomp.to_sym # do stuff end

And when the user is calling us with arguments, we pass those commands through /run/initctl .

def do_cmd ( * cmd) ctl = UNIXSocket .open( '/run/initctl' ) ctl.puts(cmd.join( ' ' )) puts(ctl.readline.chomp) exit end case ARGV [ 0 ] when 'poweroff' , 'restart' , 'halt' do_cmd( ARGV [ 0 ].to_sym) end

So can issue the command init poweroff to turn off the machine, but in order to do that we need to tell the kernel:

def sys_reboot (cmd) map = { poweroff: 0x4321fedc , restart: 0x01234567 , halt: 0xcdef0123 } syscall( 169 , 0xfee1dead , 537993216 , map[cmd]) end

These numbers are not important, what is important is that the kernel understands them, and with this we actually turn off the machine (or halt, or reboot).

Thread carefully

Obviously it would be tedious to type a bunch of commands each time the machine starts, so we need to actually do stuff after booting, however, if we do something wrong, we might render the system unusable. A simple way to solve this is to use scripts, fork a shell, and let it run those, so if there’s something wrong with the scripts, the shell dies, but not PID 1, so the system remains usable, which again, is rule #1.

Fortunately Ruby has exceptions, so we can run code with a safety net that catches all exceptions, and there’s no need to fork, which would waste precious booting time.

def action (name) print(name) begin yield rescue = > e print( ' (error: %s)' % e) end puts end

With this helper, we can safely run chunks of code, and if they fail, the error is reported to the user.

Initialization

This is the bulk of the code; the instructions you don’t want to type every time. This is mostly tedious stuff, you can skim or skip this section safely.

def mount (type, device, dir, opts) Dir .mkdir(dir) unless File .directory?(dir) system( 'mount' , '-t' , type, device, dir, '-o' , opts) end action 'Mounting virtual file-systems' do mount( 'proc' , 'proc' , '/proc' , 'nosuid,noexec,nodev' ) mount( 'sysfs' , 'sys' , '/sys' , 'nosuid,noexec,nodev' ) mount( 'tmpfs' , 'run' , '/run' , 'mode=0755,nosuid,nodev' ) mount( 'devtmpfs' , 'dev' , '/dev' , 'mode=0755,nosuid' ) mount( 'devpts' , 'devpts' , '/dev/pts' , 'mode=0620,gid=5,nosuid,noexec' ) mount( 'tmpfs' , 'shm' , '/dev/shm' , 'mode=1777,nosuid,nodev' ) end

And set the hostname.

action 'Setting hostname' do hostname = File .read( '/etc/hostname' ).chomp File .write( '/proc/sys/kernel/hostname' , hostname) end

Notice that many things can go wrong, for example the file ‘/etc/hostname’ might not exist, however, that would cause an exception, and our init would continue just fine.

Another thing we would want to do is kill all the processes, otherwise we might not be able to unmount the file-systems. We could do killall5 , but we wouldn’t have much control over the processes, and that would require a fork. Instead we can rely on the kernel to do the right thing, and all we have to do is wait for the results.

def killall def allgone? () Dir .glob( '/proc/*' ).each do | e | pid = File .basename(e).to_i begin next if pid < 2 # Is it a kernel process? next if File .read( '/proc/%i/cmdline' % pid).empty? rescue Errno :: ENOENT end return false end return true end def wait_until (timeout = 2 , interval = 0 . 25 ) start = Time .now begin break true if yield sleep(interval) end while ( Time .now - start) < timeout end ok = false action 'Sending SIGTERM to processes' do Process .kill( :SIGTERM , - 1 ) ok = wait_until( 10 ) { allgone? } raise 'Failed' unless ok end return if ok action 'Sending SIGKILL to processes' do Process .kill( :SIGKILL , - 1 ) ok = wait_until( 15 ) { allgone? } raise 'Failed' unless ok end end

Time to mount real file-systems:

NETFS = % w[nfs nfs4 smbfs cifs codafs ncpfs shfs fuse fuseblk glusterfs davfs fuse.glusterfs] VIRTFS = % w[proc sysfs tmpfs devtmpfs devpts] action 'Mounting local filesystems' do except = NETFS .map { | e | 'no' + e }.join( ',' ) system( 'mount' , '-a' , '-t' , except, '-O' , 'no_netdev' ) end # On shutdown action 'Unmounting real filesystems' do except = ( NETFS + VIRTFS ).map { | e | 'no' + e }.join( ',' ) system( 'umount' , '-a' , '-t' , except, '-O' , 'no_netdev' ) end

If you are using a modern distribution, chances are your /run and /tmp directories are cleared up on every boot, so many files and directories need to be re-created. We could do this by hand, but we could also use the systemd-tmpfiles utility which uses the configuration already provided by your distribution in tmpfiles.d directories.

action 'Manage temporary files' do system( 'systemd-tmpfiles' , '--create' , '--remove' , '--clean' ) end begin File .delete( '/run/nologin' ) rescue Errno :: ENOENT end

Unless you are using a custom kernel with modules built-in, chances are you are going to need udev, so fire it up:

action 'Starting udev daemon' do system( '/usr/lib/systemd/systemd-udevd' , '--daemon' ) end action 'Triggering udev uevents' do system( 'udevadm' , 'trigger' , '--action=add' , '--type=subsystems' ) system( 'udevadm' , 'trigger' , '--action=add' , '--type=devices' ) end action 'Waiting for udev uevents to be processed' do system( 'udevadm' , 'settle' ) end # On shutdown action 'Shutting down udev' do system( 'udevadm' , 'control' , '--exit' ) end

Finally

After all this initialization stuff, your system is most likely very usable already, and in fact I was able to start a display manager (SLiM) at this point, which was my main goal while writing this. But we are just getting started.

In control

Another thing init should do is keep track of launched daemons. Each time we do that we store the PID, and when the child exists, we remove it from the list.

def start (id, cmd) $daemons [id] = Process .spawn( * cmd) end start( 'agetty1' , % w[agetty tty1]) # On SIGCHLD key = $daemons .key(status) $daemons .delete(key) if key

Once we have this it’s trivial to report the status of them (e.g. init status agetty1 ).

ctl.puts( $daemons [args.first] ? 'ok' : 'dead' )

At this point we actually have a feature that SysVinit doesn’t have. Not bad for 200 lines of code!

cgroups

cgroups is a feature that is often misunderstood, probably because there are no good tools to make use of them, but they are not that hard. Lennart Pottering went to a lot of trouble trying to explain exactly what systemd does with them and it does not, but I don’t think he did a very good job of clarifying anything. Basically systemd is not doing anything with them Normally systemd is not doing anything with them (by default), simply labeling processes so you can see how they are grouped by using visualization tools like systemd-cgls , but that’s it.

The single most important way you can take advantage of cgroups is for scheduling purposes, so for example your web browser is a control group, and your heavy compilation is in another, then Linux scheduler would isolate the two processes from stealing resources from each other without the need of adjusting the nice level. Basically with cgroups there’s no need for nice (although you can use alongside).

But you don’t have to move a finger to get this benefit, the kernel already does that if you have CONFIG_SCHED_AUTOGROUP, which you should. Then, cgroups would be created for each session in the system, if you don’t know what sessions are, you can run ‘ ps f -eo pid,sid,cmd ‘ to find out to which session id each process belongs to.

To prove this I wrote a little script that finds out the auto-grouping as reported by the Linux kernel, and you can find groups like:

------------------------------------------------------------------------------ 503 slim -nodaemon 895 /bin/sh /etc/xdg/xfce4/xinitrc -- /etc/X11/xinit/xserverrc 901 dbus-launch --sh-syntax --exit-with-session 938 xfce4-session 948 xfwm4 952 xfce4-panel 954 Thunar --daemon 956 xfdesktop 958 conky -q 964 nm-applet ------------------------------------------------------------------------------

This is exactly what you would expect, the session leader (SLiM) starts a bunch of processes, and all of them belong to the same session, and if I compile a Linux kernel, I get:

------------------------------------------------------------------------------ 14584 zsh 17920 make 20610 make -f scripts/Makefile.build obj=arch/x86 20661 make -f scripts/Makefile.build obj=kernel 20715 make -f scripts/Makefile.build obj=mm 20734 make -f scripts/Makefile.build obj=arch/x86/kernel 20736 make -f scripts/Makefile.build obj=fs 20750 make -f scripts/Makefile.build obj=arch/x86/kvm 20758 make -f scripts/Makefile.build obj=arch/x86/mm 21245 make -f scripts/Makefile.build obj=ipc 21274 make -f scripts/Makefile.build obj=security 21281 make -f scripts/Makefile.build obj=security/keys 21376 /bin/sh -c set -e; echo ' CC mm/mmu_context.o'; ... 21378 gcc -Wp,-MD,mm/.mmu_context.o.d ... 21387 /bin/sh -c set -e; echo ' CC ipc/msg.o'; ... 21390 gcc -Wp,-MD,ipc/.msg.o.d ... 21395 /bin/sh -c set -e; echo ' CC kernel/extable.o'; ... 21399 /bin/sh -c set -e; echo ' CC [M] arch/x86/kvm/pmu.o'; ... 21400 gcc -Wp,-MD,kernel/.extable.o.d ... 21403 gcc -Wp,-MD,arch/x86/kvm/.pmu.o.d . 21405 /bin/sh -c set -e; echo ' CC arch/x86/kernel/probe_roms.o'; ... 21407 gcc -Wp,-MD,arch/x86/kernel/.probe_roms.o.d ... 21413 /bin/sh -c set -e; echo ' CC fs/inode.o'; ... 21415 /bin/sh -c set -e; echo ' CC arch/x86/mm/srat.o'; ... 21418 /bin/sh -c set -e; echo ' CC security/keys/keyctl.o'; ... ------------------------------------------------------------------------------

This group will contain a lot of processes that take a lot of resources, but the scheduler knows they belong to the same group. If somebody logs in to my machine and starts running folding@home we would have two cgroups trying to use 100% of the CPU, so the scheduler would assign 50% to one, and 50% to the other, even though the first one has many more processes. Without the grouping, the scheduler would be unfair against folding@home, giving it as much time as it gives each one of the compilation processes.

All this without you moving a finger. Well, almost.

def start (id, cmd) pid = fork do Process .setsid() exec( * cmd) end $daemons [id] = pid end

Socket activation

systemd has made a lot of fuss about socket activation, and how it’s the next best thing after sliced bread. I agree it’s a great idea, but the idea didn’t come from systemd, AFAIK it came from OSX. But, do we need systemd to get the same in Linux?

def start_with_socket (id, stream, cmd) server = TCPServer . new (stream) Thread . new do loop do socket = server.accept system( * cmd, :in = > socket, : out = > socket) end end end start_with_socket( 'sshd' , 22 , % w[ / usr / bin / sshd - i])

Believe it or not, this simple code achieves socket activation. We create a socket, and a new thread that waits for connections, if nobody connects, nothing happens, we have an idle thread, each time somebody connects, we launch ssh -i , which as far as I can tell is the same thing xinetd does, and systemd.

But hey, this is the simple socket activation, it’s not the really fancy one.

Thread . new do if managed IO .select([server]) pid = fork do env = {} env[ 'LISTEN_PID' ] = $$.to_s env[ 'LISTEN_FDS' ] = 1 .to_s Process .setsid() exec(env, * cmd, 3 = > server) end $daemons [id] = pid else loop do socket = server.accept system( * cmd, :in = > socket, : out = > socket) end end end

There, this does exactly the same thing as systemd (at least for one socket, multiple ones are easy too), so yeah, we have socket activation.

But wait, there’s more

Hopefully this covers the basics of what an init system should do, and how it’s not rocket science, nor voodoo. It is actually something very straightforward; start the system, keep it running, simple. Of course there’s many other things an operating system should do, but those things don’t belong to the init system, don’t let anyone tell you otherwise.

I have more changes on top of this that bring my little toy init system almost up-to-par to Arch Linux’s initscripts, which is what they used before moving to systemd, so chances are if you use my init, you would have little to no problems in your own system.

Unlike systemd and others, this code is actually very readable, so you can add and remove code as you like very easily, and of course, the less code you have, the faster you boot.

Personally when I hear somebody saying “Oh! but OpenRC doesn’t have socket activation, we need systemd!”, I just roll my eyes.

If you want to give it a try, get the code from GitHub:

https://github.com/felipec/finit

Cheers.