puppet is a popular configuration-management tool. Puppet’s basic model is declarative: You define a set of “resources” and the state they should be in. A “resource” can be basically anything that might be managed on a server: file on disk, a user account, a provisioned database instance, a running service, …

Puppet compiles the input puppet configuration into a “catalog” with all the defined resources, and creates a dependency graph: e.g. before the MySQL service can be started, the MySQL package has to be created.

Applying a puppet catalog involves walking the catalog in dependency order, analyzing each resource in turn and modifying the running system to reflect the desired state (creating or removing a user, starting or stopping a service, …).

As with most problems in engineering, puppet has to deal with the ever-present possibility of failures or errors: What happens if a resource node cannot be applied correctly? Permission errors, insufficient disk space, being asked to install a typoed package, …

If a resource fails, puppet records this fact and then continues applying the catalog, attempting to apply as much of the catalog as it can. Since it maintains a dependency graph, it can selectively skip only the resources that depend on a failed resource.

However, until recently, puppet implemented this skipping by, for each node, visiting each recursive-dependency and checking if that failed

It performed this check regardless of whether any failures had happened or not, for every node. This trivially leads to O(n²) behavior for a depth-N dependency chain!

My fix, scheduled for release with Puppet 4.2, attaches a list of failed recursive-dependencies to each node. When visiting a node, the list is computed for that node by directly unioning the lists of the immediate dependencies.

To demonstrate the fix I constructed a series of puppet manifest that just included N notify resources in a linear chain, and compared runtime before and after my patch:

[edited to add]: The above graph is plotted to an artificially large N to make the quadratic behavior extremely obvious to visible inspection; I don’t mean to imply that real manifests will have depth-6000 dependency trees. However, the patch is also a significant improvement on real-life manifests: As noted in the PR, it cut puppet runtime nearly in half on many real servers at Stripe.