Fault Tolerance doesn't come out of the box - Alchemy 101: Part 3

2017-03-29 by Thomas Hutchinson

Alchemy 101: Part 3 - Fault Tolerance doesn’t come out of the box

This is Part 3 in our Alchemy 101 series. Catch up on Part 1: Elixir Module Attributes and Part 2: From Elixir Mix Configuration to Release Configuration.

One of the biggest selling points of Elixir is the means it gives you to write fault tolerant applications via its concurrency model. Processes can broadcast their failure to dependant processes which can take appropriate action. You decide how processes should respond to failure based on your use case. There is no single solution. I’ll give you an example in which not handling failure led to, you guessed it, more failure.

First install and start RabbitMQ.

Then create a new project.

mix new rabbit --module Rabbit cd rabbit

Then add the amqp dependency to mix.exs.

defp deps do [{ :amqp , " ~> 0.1.5" }] end

Now add the following to lib/rabbit.ex.

defmodule Rabbit do use GenServer require Logger def start_link do GenServer . start_link ( __MODULE__ , nil ) end def init ( _ ) do { :ok , connection } = AMQP . Connection . open ( " amqp://localhost" ) { :ok , channel } = AMQP . Channel . open ( connection ) { :ok , channel } end def publish ( pid , payload ) do GenServer . cast ( pid , { :publish , payload }) end def handle_cast ({ :publish , payload }, channel ) do AMQP . Basic . publish ( channel , " " , " " , payload ) Logger . info ( " Published #{ payload } " ) { :noreply , channel } end end

That’s all you need, fetch the dependencies and test Rabbit.

mix deps . get iex - S mix iex ( 1 ) > { :ok , pid } = Rabbit . start_link () { :ok , #PID<0.139.0>} iex ( 2 ) > for i <- 1 .. 3 , do : Rabbit . publish ( pid , " message #{ i } " ) 19 : 21 : 31.471 [ info ] Published message 1 19 : 21 : 31.471 [ info ] Published message 2 19 : 21 : 31.471 [ info ] Published message 3

Head to the default exchange and you should see some activity on the Message rates chart. Keep the above IEx shell open, you will need it again soon.

Time to introduce a problem. Restart RabbitMQ.

brew services restart rabbitmq

Go back to the IEx shell, notice how there is an OTP error report. Looks serious, the main takeaway from it is below. The socket to RabbitMQ closed causing a process to crash.

19:22:57.994 [error] GenServer #PID<0.142.0> terminating ** (stop) :socket_closed_unexpectedly Last message: :socket_closed

But it still appears like you can publish messages. Try it.

iex(3)> for i <- 1..3, do : Rabbit.publish ( pid, "message #{i}" ) 19:21:31.471 [ info] Published message 1 19:21:31.471 [ info] Published message 2 19:21:31.471 [ info] Published message 3

But if you head to the default exchange there appears to be no new messages coming in. But why?

First of all we don’t see any failures in the Rabbit process because AMQP.Basic.publish/4 ultimately leads to a :gen_server.cast/2 (amqp_channel.erl in rabbit_common) being called with channel.pid. :gen_server.cast/2 will not return or throw an error if the PID (in this case channel.pid) does not exist. This means the failure was hard to detect. Now imagine if your application was running in the background, this could have been even more difficult to spot.

Here comes the good part, how to handle the failure. We want Rabbit to be sent a message when the socket to RabbitMQ is closed. To do this we need to link to the channel process (to receive an exit message if it stops) we started and trap exits i.e. not crash if we receive an exit signal. To do this add the following code.

def init ( _ ) do { :ok , connection } = AMQP . Connection . open ( " amqp://localhost" ) Process . flag ( :trap_exit , true ) { :ok , channel } = AMQP . Channel . open ( connection ) Process . link ( connection . pid ) { :ok , channel } end def handle_info ({ :EXIT , from , :socket_closed_unexpectedly }, channel ) do Logger . warn ( " Received :EXIT from #{ inspect ( from ) } for #{ :socket_closed_unexpectedly } " ) { :stop , :lost_rabbitmq_connection } end

Kill and start the IEx shell and then run the following.

brew services restart rabbitmq

Jump to the IEx shell and observe the output.

19:27:33.497 [warn] Received :EXIT from #PID<0.142.0> for :socket_closed_unexpectedly

Perfect, now when the socket is closed the dependent process, Rabbit, is informed. Now to respond to the failure. This is something you must decide, no one including myself, library/framework developers and others can tell what to do. This is in your hands. It is also the reason why Elixir (and Erlang) has such a good reputation when it comes to building fault tolerant systems. You decide on how to respond to failure, there is no single silver bullet. For the demo though, I have choosen to simply stop then Rabbit GenServer when I receive an :EXIT for :socket_closed_unexpectedly.

Hope you enjoyed reading. Feel free to learn more about RabbitMQ and please give me your feedback.

This is Part 3 in our Alchemy 101 series. Catch up on Part 1: Elixir Module Attributes and Part 2: From Elixir Mix Configuration to Release Configuration.