What’s going on here?

I will repeat my usual message one more time, because I still think it’s useful: Don’t read below this paragraph without previously tried to understand the issue on your own. Go and try it, I’ll be right here…

OK, now that you tried on your own, let’s see if you found the same thing that Marc discovered…

Restart Intensity

Let’s talk a little bit about supervision configuration first. In particular I would like to show you this paragraph from Erlang docs:

To prevent a supervisor from getting into an infinite loop of child process terminations and restarts, a maximum restart intensity is defined using two integer values specified with keys intensity and period in the [sup_flags()] map. Assuming the values MaxR for intensity and MaxT for period, then, if more than MaxR restarts occur within MaxT seconds, the supervisor terminates all child processes and then itself. intensity defaults to 1 and period defaults to 5.

Let’s check our animals_sup module:

As you can see, we used 5 for intensity and 1 for period. That means that if 5 restarts happen during a single second, the supervisor itself will die. We had 10 dodos, we killed them all. That clearly demands more than 5 restarts in a single second.

So, that’s why the dodos supervisor died! Problem solved! Right?

Not so fast… we had 7 cockroaches, too. We also killed them all. That also required more than 5 restarts in a second. Why was this supervisor restarted and the one for the dodos was not?

Restart Strategy

We determined that both supervisors (cockroaches and dodos) died, but only one was restarted. The process in charge of restarting them is world_sup. Let’s check it out:

Can you spot the difference?

Both supervisors are defined in the ChildSpecs part of the supervisor definition returned by init/1. The difference between them is in the restart field. While dodos are transient, cockroaches are permanent.

But, what does that mean? Let’s check Erlang docs again:

restart defines when a terminated child process must be restarted. A permanent child process is always restarted. A temporary child process is never restarted (even when the supervisor’s restart strategy is rest_for_one or one_for_all and a sibling’s death causes the temporary process to be terminated). A transient child process is restarted only if it terminates abnormally, that is, with another exit reason than normal, shutdown, or {shutdown,Term}.

Ok, then, since cockroaches is a permanent supervisor, it will always be restarted. That makes sense.

Now, what about the dodos? Well, that supervisor is terminated because it reaches it’s max restart strategy. That looks a lot like an abnormal termination. Nevertheless, it’s not restarted. Why?

SASL to the Rescue!

Remember I told you I added sasl as a dependency of my app. Well, sasl not only prints out progress reports, it also displays crash reports when the processes are terminated. I intentionally removed them from the console snippets above, but let’s see what they have to say now that we know what we’re looking for…

1> world:start().

... =PROGRESS REPORT==== 29-Oct-2016::01:30:30 ===

application: world

started_at: nonode@nohost 2> world:add_dodos(6).

ok

3> world:kill_dodos(). =SUPERVISOR REPORT==== 29-Oct-2016::01:32:19 ===

Supervisor: {local,dodos}

Context: child_terminated

Reason: killed ...[5 similar child_terminated messages]... [true,true,true,true,true,true] =SUPERVISOR REPORT==== 29-Oct-2016::01:32:19 ===

Supervisor: {local,dodos}

Context: shutdown

Reason: reached_max_restart_intensity

Offender: [{pid,<0.127.0>},

{id,animal},

{mfargs,{animal,start_link,[dodos]}},

{restart_type,permanent},

{shutdown,5000},

{child_type,worker}]

4>

We have 5 processes killed with reason killed. Then the 6th one is killed with reason reached_max_restart_intensity. But we have no report of how the dodos supervisor was killed. Considering the docs shown above, it seems like the reason for termination of the supervisor was normal, shutdown or {shutdown, Reason}. SASL shows no reports about it, because it terminated normally.

Let’s look at the logs for cockroaches to see if we can find more info…

4> world:add_cockroaches(6).

ok

5> world:kill_cockroaches().

[true,true,true,true,true,true] =SUPERVISOR REPORT==== 29-Oct-2016::01:48:41 ===

Supervisor: {local,cockroaches}

Context: child_terminated

Reason: killed ...[6 child_terminated messages]... =SUPERVISOR REPORT==== 29-Oct-2016::01:48:41 ===

Supervisor: {local,cockroaches}

Context: shutdown

Reason: reached_max_restart_intensity

Offender: [{pid,<0.149.0>},

{id,animal},

{mfargs,{animal,start_link,[cockroaches]}},

{restart_type,permanent},

{shutdown,5000},

{child_type,supervisor}] =SUPERVISOR REPORT==== 29-Oct-2016::01:48:41 ===

Supervisor: {local,world_sup}

Context: child_terminated

Reason: shutdown

Offender: [{pid,<0.125.0>},

{id,cockroaches_sup},

{mfargs,{animals_sup,start_link,[cockroaches]}},

{restart_type,permanent},

{shutdown,5000},

{child_type,supervisor}] ...

6>

In this case, since the cockroaches supervisor is permanent and that means it has to be restarted regardless of the reason why it’s terminated, we do have a report from sasl. That report lets us to see that the reason for cockroaches supervisor termination is, in fact, shutdown.

That may seem unexpected at first glance. Considering that one of the goals of the supervision tree is to be able to escalate issues so that higher levels in the app tree can better deal with problems in lower levels, I would’ve expected world_sup to restart dodos supervisor after dodos determined it was unable to deal with too many children restarts and killed itself.

The OTP team probably had a good reason for this implementation. I just don’t know it. So, if you know the reason, let me know in the comments below, and remember: