Notes on how systemd's Job engine is structured

Introduction

systemd (PID1 and associated auxillary daemons) is the defeault initialization system on majority of GNU/Linux systems. This document mostly revolves around PID1 and how it works, and documents some of the internal mechanisms that I have found to leak out from the Unit abstraction (certainly not something you can totally reason about without delving into the inner mechanisms of PID1, hence the claim), and as such their understanding is key to be able to make sense of certain aspects of its behavior, and understanding the debug output when troubleshooting the system.

Familiarity with systemd is a prerequisite, and this commentary does not go into whether or not systemd does the right thing, as much as it does not go whether using it is the right choice. However, I do point in various places the problems I observe, because as much as it is said to be and compared to prior work in this domain (daemontools, sysvinit+initscripts, nosh, s6, runit, SMF, launchd), it is in essence an object system creating a network of dependencies between them using it to drive execution of processes.

It (PID1) encompasses the roles of cron, incron/inotify{wait,watch}, atd, BSD automounter, inetd, UCSPI, and any other event source of choice (that you may plug into it through the bus interface, queueing jobs on some event - however, you do lose some properties, like exclusively tracking state of jobs, and gain some races as part of asynchronicity of the interface) and unifies all these event sources under one execution engine, adding to it a powerful and complex graph resolution framework that can be used to pull in other jobs, checking the entire transaction for consistency before being merged and executed, and using state changes in jobs as a way to plug into this propagation framework to trigger other jobs (eg. OnFailure=). As such, in the words of the authors themselves, it is:

A dependency network between objects, usable for propagation, combined with a powerful execution engine is basically what systemd is.

This however means that apart from the dependency engine and propagation framework, event sources, and code that manages the execution context of a service (residing in PID1 - the binary, but the code path only being hit in the child process it forks off as part of every job that is executable), it also contains the logic for cgroup resource control (using them as a basic process management unit in essence, as much as they are not a Job API, and resource control comes as a side effect of choosing them - the idea however has merit, and the kernel API in its next iteration, cgroupsv2, seems to be evolving into one, controlling many facets of process scheduling, and even metrics (see the PSI patchset), and the freezer being queued up for inclusion, BPF based hooks to control their behavior, access control - all hint towards this).

Units, Jobs, and Transactions - The holy trinity

systemd organises system resources as units, and all of these units are part of the dependency graph. You are expected to mostly deal with these units when operating the system, and not care about the semantics of the underlying mechanism used - that is seldom the case. Not all of them deal with processes, some cannot be controlled via systemd (device units), and each have varying semantics (targets gain orthogonal After= dependency types on Requires= automatically, and each have a set of special defined ImplicitDependencies which are non overridable, there are DefaultDependencies which are). Slices don't use such dependency types but implicitly order themselves against the parent slice (and use Slice= as a way to configure that in other units), a consequence of exposing cgroup hierarchies as units.

More on units later, but let's understand what the Manager object does in systemd. The Manager object (as seen on the bus as well) is at the centre of it all, and also handles transactions (more on them later too). It has a queue where units are loaded from disk (but only those which are part of the graph, systemd fakes "loading" of those not referenced through `systemctl status` by loading them on request, and then immediately dropping such a unit from memory). This loading task converts the Unit files to the internal representation (flat at first level, and a giant key value pair), so does programatically generating one (what you see in `systemctl show`). There are bookkeeping tasks performed to make units invalid (when they enter a failed or inactive state (see CollectMode=) or don't hold any jobs (there is only one job per unit)) and to garbage collect them (strategy configurable through CollectMode=) for invalidation again. These happen in two separate queues (the run_queue and gc_queue of the manager). The Manager object is what exposes operations to enqueue jobs on the bus (maintaining its own dbus_queue).

Jobs encapsulate the execution of a unit (not in the sense of a process, however). There are various types (JOB_START, JOB_STOP, JOB_VERIFY_ACTIVE, JOB_RESTART, JOB_RELOAD, JOB_NOP, JOB_RELOAD_OR_START), job modes (which are exposed through systemctl, man systemctl and read about --job-mode=, and decide how the jobs being queued should affect jobs that are in the waiting or running state i.e. the other jobs already queued, dependencies implied against them which will determine how the said job is treated when merging jobs together or collapsing them (tuning the job type based on the state of a unit), or the heuristics made use of when they're part of a transaction for merging and garbage collection (in relation to what the unit's state might be).

JOB_VERIFY_ACTIVE as its name suggests serves as a way of determining whether the unit is already active, and in such a case, the queued job shall be skipped, with the job result JOB_SKIPPED (see src/core/job.h enum JobResult for others). Interesting observation is how there would be no distinction when a condition fails, in that whether the job completed successfully or failed due to any of the Conditions being violated cannot be distinguished easily from the result (both return JOB_DONE). JOB_TIMEOUT as a result is usually when the job fails to report readiness, however that is defined for the unit type.

Job merging is described briefly in src/core/job.c, in that it is a matrix with rules how jobs will be merged:

Merging is associative! A merged with B, and then merged with C is the same as A merged with the result of B merged with C. Mergeability is transitive! If A can be merged with B and B with C then A also with C. Also, if A merged with B cannot be merged with C, then either A or B cannot be merged with C either.

This merging simply means they will coalesced in the same transaction. There are other conditions they need to follow to make this happen without any conflicts (however the Manager may choose to reorder jobs as it wishes if the transaction can be made successful) and can collapse into other job types as need be (for those that depend on active unit state, like JOB_RELOAD_OR_START, JOB_TRY_RELOAD, and JOB_TRY_RESTART). In particular, the transaction building and consistency checks for mergeability are orthogonal to jobs being installed, and merging rules differ for installed jobs (in particular, conflicting jobs get cancelled - not deleted because one is allow to ref them on DBus, jobs that are simply waiting are safe to merge into (but exceptional case being JOB_RELOAD), and jobs that are running produce more inconsistency - some might be safe and some might be not, hence the job is marked as merged (job_merge_into_installed) but re-run again. There is a note in source where it is said that queueing it after the installed job's completion might be the sensible choice, but due to the limitation that there can be only one job per unit, this mechanism has to be taken help of. It was not clear from my reading why there can be only one job per unit (perhaps because it will also influence what the state of the unit turns out to be), but this remains unanswered.

There are some other types I did not list before, like JOB_TRY_RESTART, JOB_RELOAD_OR_START and JOB_TRY_RELOAD, which are special in the sense that their tuning will depend on the state of the Unit, and they will imeddiately undergo collapsing (as the intention is to comply with the clients request of trying one of these combined operations *at request time*). Each of these collapse to their respective job type unless the unit is already running where they we will collapse to JOB_NOP (and may be merged into JOB_NOP again as the unit is running, which is why they collapsed into JOB_NOP in the first place). This must remind some of messages like "Transaction is desctructive" which is simply caused when some other job type merges with JOB_NOP other than JOB_NOP itself (There are more reasons why it might be a desctructive transaction: it could be the fail job mode that causes conflicting jobs to deem the transaction desctructive, or irreversible jobs in queue). Naturally, JOB_NOP will cause no other jobs to be pulled as part of the transaction it is part of, takes a special slot in the Unit object (u->nop_job) for transaction builders to drop it later. (See job.c: pj = (j->type == JOB_NOP) ? &j->unit->nop_job : &j->unit->job;).

While building a transaction, each job has a subject Job list and an object Job list, which will describe what the Job is that is requesting (the subject) and what it needs (the object). All dependencies of job are recursively added until all requirements and relationships are satisfied (including all the postivie, negative, inverse (WantedBy= et al. that become positive for the referenced unit after serialization), propagative (in case of JOB_RELOAD, ProgpagateReloadFrom= and friends are also included), event based (BindTo=)). Again, an anchor job will mean the subject becomes NULL (as seen in the source), and the functions that accept treat it accordingly. Each job has a unsigned generation integer and a job marker that is refcounted (see Job* marker and unsigned generation in struct Job). There is also a GC marker which is made use of in a mark-and-sweep GC logic for jobs part of a certain transaction.

enum JobMode in job.h lists various job modes and what effects they have, interesting to note is how JOB_MODE_DEPENDENCIES and JOB_MODE_REQUIREMENTS are two distinct job modes, and rechristens the point made before that requirement dependencies and ordering clauses are totally orthogonal. Reading Davin McAll's commentary on dependencies, they made a point that order and requirement shouldn't entirely be orthogonal. Perhaps this is true, but is very specifically tied to the implementation of the dependency engine in question. There are cases in systemd where a unit can be activated as part of transaction and one might need proper ordering semantics but not necessarily requirement semantics, i.e., only order us if we're already part of the transaction and not otherwise. This is a useful bit most dependency based schemes ignore. Job modes like isolate are often exposed through systemctl isolate and referred to as ones that cause all other units to be stopped but the one requested to be up, in reality, if there is an already running job, the iterator will treat the ones with an OnFailure= action exceptionally (as cancelling these will result in more jobs being triggered, as the job result is that of a failure). Others are mostly obvious.

There is an anchor job in a transaction, that is explicitly what was asked for originally, before that job underwent merges and collapsing, and a pointer to that anchor job in what is the hashmap of the transaction which has multiple jobs and booleans per job that tells whether the said job is irreversible or not (irreversible jobs shall only be canceled explicitly, and nobody else can cause their removal/cancellation on conflicts (i.e. those jobs that get pulled in by others) - one cannot even enqueue themselves after them while they are still waiting). When traversing through the graph recursively to find and delete jobs that are causing ordering cycles, in that the generation integer is equated and the marker is checked to be non-NULL to see if the graph has already been traversed (which would indicate cycles), at which point certain heuristics are applied to choose what job to drop from the transaction to fulfill its consistency. In particular, it begins walking backwards and checks whether any of the jobs upwards are related to the anchor job somehow (transaction_find_jobs_that_matter_to_anchor in transaction.c, and chooses to drop them from the transaction if not. This operation is done recursively until jobs of the transaction can be merged. When a condition where two units are non merging jobs with the third one comes up, it is chosen how one of those two would be dropped: start units will be favored over stop units except if there is already a conflict by another unit and it is stopping. Mergeability this is checked over iteratively and over and over until cycles are broken. This appears to have been implemented in a way that the job being queued loses as little as jobs it requires and the boot becomes acyclic again.

In a transaction, based on what is configured as the CollectMode= (for states like inactive/failed or both), jobs will be garbage collected. Enqueing a single job is fairly straightforward (when it has no deps), the job is retrieved from the hash table if available or a new one is allocated, setting its generation int and Job marker (see transaction_add_one_job).

Redundant job types are dropped (job_type_is_redundant) returns true for JOB_NOP (we already discussed why that makes sense), and sets the Unit status for other job types, this is taken care of by the manager as part of transaction building.

One already knows there are many unit types, and documentation is plentiful, so we will not dive into those, however one would note how automount/mount differ just by lazy initilazation and mount/swap just by what binary they encapsulate (from util-linux, mount or swapon). Slices as already pointed out are an exception when it comes to expressing dependencies (in that they implicitly gain dependencies from the parent slice). Targets do not treat dependency requirements and ordering clauses as orthogonal by default. All of this certainly does not contribute to a unifying mental model, since there are too many inconsistencies. Socket units are already named badly (because they can not only activate on all types of sockets but on named pipes, special files, USB GadgetFS descriptors, etc). TimeoutStopSec= is not available in every unit, and TimeoutSec= configures both the start and stop timeout, which is undesirable in mount units.

Units encapsulate jobs, in that a job affects as to what the Unit's state is and will be, and how one can change that. Running units are internally organized as vtables (UNIT_VTABLE) for tasks that require dynamic introspection of their state (start/stop/kill etc), and the Manager will use the dispatch table to trigger these for propagation uses, and depending on a Unit, a certain operation being called might not be supported (the poster boy being device units, which support none of start/stop/restart, as they envelope the state of the udev subsystem, and as such as don't go through unit state transitions (stopping -> stopped) and will not trigger on relationships that cause other jobs to be generated (like PartOf= - hence the extensive use of BindTo= which is rather event based). There is clearly a distinction between a Unit as what is serialized, the running instance, and references to Units (which can be tuned to point to the same unit again).

Dependencies don't really mean what you thought

Dependencies in systemd (as registered under [Unit]) encapsulate all facets of how a Unit may behave, from events, relationships, and ordering to their propagation affects. However, this produces a lot of coginitive inconsistency for users, and generally the interactions are not well understood and subtle. There are requirement types for expressing dependency on some other unit, ordering clauses which are orthogonal (but not in targets!), and other propagation types, which are either assymetric (PartOf= but no ConsistsOf=, it's reverse property, currently only automatic i.e. marked internally, making such a requirement asking for changes in every unit with PartOf= instead of just one with ConsistsOf=), ad-hoc (PropagateReloadsTo=, PropagatesReloadFrom=, but none for start/stop/restart, as they are job types different from JOB_RELOAD, but still under [Unit]), RefuseManualStart=, RefuseManualStop=, but no RefuseManualRestart= (which, under systemd, is a distinct operation as file descriptors aren't flushed because JOB_RESTART is turned into a JOB_START after JOB_STOP) or RefuseManualReload=. This is a consequence of systemd putting dependency requirements, propagators, ordering and positive/negative relationships all under the broad umbrella of "dependencies" from the users point of view, and hence unified under the Unit object (and hence under [Unit]). However, it is a mix of many unrelated operations from a literal dependency point of view, and includes even more miscellany (like OnFailure=) and event based clauses (BindTo=) which internally are dependencies, and RequiresMountsFor= which even takes a special slot (recursively adding dependencies for a mount point).

For dependencies, see https://bl33pbl0p.github.io/systemd_dependency

Understanding jobs and transactions is very important. For instance, when one adds a Requires=B.service in A.service, and does not add After=, the effect this produces is that while both B and A will be part of a transaction, A may succeed before B fails starting up (as no After= has been specified), or A may fail before B succeeds (mitigated by the use of Before= in B). This interaction produces two results instead of one, where one will succeed without being fatal, while this was not the intended outcome, all of it depending on the runtime state.

Unrelated stuff