The program is made up of a non-blocking parent process, which, among other functions, is mainly responsible for supervising several child processes, spawning replacement children should any exit unexpectedly.

The parent process uses AnyEvent for the non-blocking functionality, and the code specifically forces the use of the EV backend event loop (which is a wrapper to the libev library), instead of letting AnyEvent auto-detect an event loop at runtime. This is done so that — across many codebases — the same event loop is always used.

Also relevant to the story is that at startup, the parent process announces itself to a system which, for the purpose of this post, is like a service discovery system. To do this, it uses Sys::Hostname to get the current host’s hostname.

A simple example of this functionality might look something like this:

When run, the output looks something like this:

$ perl basic-supervisor.pl

[92815] announcing myself to service discovery

[92816] child ready, randomly doing stuff

[92817] child ready, randomly doing stuff

[92818] child ready, randomly doing stuff

[92819] child ready, randomly doing stuff

[92820] child ready, randomly doing stuff

[92815] new parent here, looking after children for the rest of my life

And now, in another terminal, if a process were to be killed:

$ kill 92816

The supervisor will fork a new child to replace it:

[92815] child 92816 died, respawning a new one...

[92822] child ready, randomly doing stuff

Great! It does exactly as expected! One of the child processes was killed, the parent process was notified via the event loop, and it replaced the dead process with a new one.

This code was developed and due to be deployed onto a host, running Perl 5.8.9 on CentOS 5.

The Problem

Part-way through development, this particular system was replatformed into a new provider, upgraded to CentOS 6, and also upgraded to Perl 5.20.1.

So I redeployed everything onto a new development host based on the new production servers’ architecture… and the process management functionality completely stopped working; any time a child process exited unexpectedly, the parent never noticed and thus, a new child process was never forked to replace it.