[systemd-devel] [HEADSUP] cgroup changes

Heya, On monday I posted this mail: http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html Here's an update and a bit on the bigger picture: Half of what I mentioned there is now in place. There's now a new "slice" unit type in place in git, and everything is hooked up to it. logind will now also keep track of running containers/VMs. The various container/VM managers have to register with logind now. This serves the purpose of better integration of containers/VMs everywhere (so that "ps" can show for each process where it belongs to). However, the main reason for this is that this is eventually going to be the only way how containers/VMs can get a cgroup of their own. So, in that context, a bit of the bigger picture: It took us a while to realize the full extent how awfully unusable cgroups currently are. The attributes have way more interdependencies than people might think and it is trivial to create non-sensical configurations... Of course, understanding how awful the status quo is a good first step. But we really needed to figure out what we can do about this to clean this up in the long run, and how we can get to something useful quickly. So, after much discussion between Tejun (the kernel cgroup maintainer) and various other folks here's the new scheme that we want to go for: 1) In the long run there's only going to be a single kernel cgroup hierarchy, the per-controller hierarchies will go away. The single hierarchy will allow controllers to be individually enabled for each cgroup. The net effect is that the hierarchies the controllers see are not orthogonal anymore, they are always subtrees of the full single hierarchy. 2) This hierarchy becomes private property of systemd. systemd will set it up. Systemd will maintain it. Systemd will rearrange it. Other software that wants to make use of cgroups can do so only through systemd's APIs. This single-writer logic is absolutely necessary, since interdependencies between the various controllers, the various attributes, the various cgroups are non-obvious and we simply cannot allow that cgroup users alter the tree independently of each other forever. Due to all this: The "Pax Cgroup" document is a thing of the past, it is dead. 3) systemd will hide the fact that cgroups are internally used almost entirely. In fact, we will take away the unit configuration options ControlGroup=, ControlGroupModify=, ControlGroupPersistent=, ControlGroupAttribute= in their entirety. The high-level options CPUShares=, MemoryLimit=, .. and so on will continue to exist and we'll add additional ones like them. The system.conf setting DefaultControllers=cpu will go away too. Basically, you'll get more high-level settings, but all the low level bits will go away without replacement. We will take away the ability for the admin to set arbitrary low-level attributes, to arrange things in completely arbitrary cgroup trees or to enable arbitrary controllers for a service. 4) systemd git introduced a new unit type called "slice" (see above). This is for partitioning up resources of the system into slices. Slices are hierarchial, and other units (such as services, but also containers/VMs and logged in users) can then be assigned to these slices. Slices internally map to cgroups, but they are a very high-level construct. Slices will expose the same CPUShares=, MemoryLimit= properties as the other units do. This means resource management will become a first-class, built-in functionality of systemd. You can create slices for your customers, and in them subslices for their departments, and then run services, users, vms in them. In the long run these will by dynamically moveable even (while they are running), but that'll take more kernel work. By default there will three slices: "system.slice" (where all system services are located by default), "user.slice" (where all logged in users are located by default), "machine.slice" (where all running VMs/containers are located by default). However, the admin will have full freedom to create arbitary slices and then move the other units into them. 5) systemd's logind daemon already kept track of logged in users/sessions. It is now extended to also keep track of virtual machines/containers. In fact, this is how libvirt/nspawn and friends will now get their own cgroups. They register as a machine, which means passing a bit of meta info to systemd, and getting a cgroup assigned in response. This registration ensures that "ps" and friends can show to which VM/container a process belongs, but easily allows other tools to query container/VM info too, so that we'll be able to provide an integration level of containers/VMs like solaris zones can do it in the long run. So, this all together sounds like an awful lot of change. #1 and #2 are long term changes. However #3, #4, #5 are something we can do now and should do now, as prepartion for the single-writer, unified cgroup tree. We really, really shouldn't ship the cgroup mess for longer, so that people make use of the current systemd APIs that expose way too many internal guts, stuff that we *know* right now is broken and will cease to exist. We don't want to expose low-level details we already know *now* we cannot support for long. Even though #3, #4, #5 sound like major work they are not. In fact #4 and #5 are fully implemented on the systemd side already now upstream. I am working on #3. I am confident that I'll have this finished in a few days too, since this is really actually just about deleting code more than writing code. With #3, #4, #5 we have something in place that should do the basic things and first and foremost will hide all the lower-level details of cgroups. This has the big benefit of allowing us to rearrange these details later without having to break the user or programming interfaces, and that's what I really care about here. Now, what does this mean for other projects using cgroups? So basically, since we won't implement #1 + #2 immediately the cgroup tree stays relatively open for other cgroup users. They can continue to fiddle with it for now, but it must be clear that this is temporary, and that they don't attempt too fancy things. Direct access to the cgroup tree is on is way out and that must be clear to everybody. More specifically: libcgroup is out of the game with this. libvirt/openshift/lxc/.. can continue to do what they do for now, however they should be updated sooner rather than later to do things the systemd way, i.e. rely on systemd VM/container registration and user cgroup management. And to make one last thing clear: this time, it's not Kay and me who are taking away the cgroup tree from everybody else, it's actually all Tejun's fault as the kernel cgroup maintainer... ;-) He wants a unified, single-writer hierarchy, and it took us a while to agree to that, but we're now fully on the same page with him. If you are using non-trivial cgroup setups with systemd right now, then things will change for you. We will provide you with similar functionality as before, but things will be different and less low-level. As long as you only used the high-level options such as CPUShares, MemoryLimit and so on you should be on the safe side. I hope this makes some sense, Lennart -- Lennart Poettering - Red Hat, Inc.