While working on my WorkerNet post, I stumbled across a weird behaviour with start_links, trap_exit and slave nodes.

Long Story (sorry, there is no short one)

As I was setting up a distributed test with slaves, I also wanted one gen_server to trap_exit’s for the offsprings sake which I did not wish to be put under a supervisor (shame on me ;), suddenly – all of the tests stopped working! All of them where either timing out or reporting direct noprocs. Bewildered and wide eyed at 23:40 I gave it a go with the dbg tracer and even went through some of the gen_server source.

No answer.

I chalked it up to the rpc calls for the remote nodes, tried printing out the process numbers in each step. But no – it was a fact. My gen_servers died the instant they where created… Brooding over it, I tried some more but finally went to sleep. Up to then, I knew that the problem was caused by the following two snippets in combination with rpc calls to my local slave nodes

start_link () -> gen_server : start_link ({ local ,? MODULE },? MODULE ,[],[]). init ([]) -> process_flag ( trap_exit , true ), { ok , ok }.

While the non trap_exit’d version worked like a charm. Not wanting to waste more time on it, I just circumvented it like a cheap rug on a very dark and very deep embarrassing hole in the floor with

start_link ( succeed ) -> { ok , Pid } = gen_server : start ({ local , ? MODULE }, ? MODULE , [], []), link ( Pid ), { ok , Pid }. init ([]) -> process_flag ( trap_exit , true ), { ok , ok }.

But I couldn’t leave it at just that. I had to seek help, and so I showed it to my senior colleague Nicolas, I had then devised a test which would reproduce this neatly. He cut it down a bit, and I boiled it to the broth you see here and can compile and run for yourself.

Just for the record: The seemingly expected behaviour would be to see the exit signals appear in the handle_info/2 – not causing the process to crash.

%%% ------------------------------------------------------------------- %%% @author Gianfranco < zenon@zen.local %%% @copyright (C) 2011, Gianfranco %%% Created : 17 Jan 2011 by Gianfranco < zenon@zen.local %%% ------------------------------------------------------------------- -module ( test ). %% API -export ([ start_link/1 ]). -export ([ test/1 , init/1 , handle_info/2 , terminate/2 ]). -spec ( test ( fail | succeed ) -> term ()). test ( Mode ) -> io : format ( "Current 0 ~p~n" ,[ self ()]), spawn ( fun () -> io : format ( "Current 1 ~p~n" ,[ self ()]), { ok , _P } = ? MODULE : start_link ( Mode ) end ). start_link ( fail ) -> gen_server : start_link ({ local ,? MODULE },? MODULE ,[],[]); start_link ( succeed ) -> { ok , Pid } = gen_server : start ({ local , ? MODULE }, ? MODULE , [], []), link ( Pid ), { ok , Pid }. init ([]) -> process_flag ( trap_exit , true ), { ok , ok }. handle_info ( timeout , State ) -> { stop , normal , State }; handle_info ( _Info , State ) -> io : format ( "info ~p~n" ,[ _Info ]), { noreply , State ,5000}. terminate ( _Reason , _State ) -> io : format ( "reason ~p~n" ,[ _Reason ]), ok .

Compiling and running we see the expected and unexpected, I chose to call it succeed and fail, based on that the process dies (fails) and succeeds (succeed) in trapping

zen : Downloads zenon $ erlc test . erl zen : Downloads zenon $ erl Erlang R14B ( erts -5.8.1) [ source ] [ smp :4:4] [ rq :4] [ async - threads :0] [ hipe ] [ kernel - poll : false ] Eshell V5 .8.1 ( abort with ^ G ) 1> test : test ( fail ). Current 0 <0.31.0> Current 1 <0.33.0> <0.33.0> reason normal 2> test : test ( succeed ). Current 0 <0.31.0> Current 1 <0.36.0> <0.36.0> info { 'EXIT' ,<0.36.0>, normal } (5 seconds later) reason normal 3>

As you see, the process did not die after initialization. It trapped the spawner’s end. One possible explanation could be the one stated is in the module gen_server.erl (read the source Luke!)

%%% --------------------------------------------------- %%% %%% The idea behind THIS server is that the user module %%% provides (different) functions to handle different %%% kind of inputs. %%% If the Parent process terminates the Module:terminate/2 %%% function is called. %%%

Some more digging into this, Nicolas came with the idea of sys:get_status/1 ing the processes. What was revealed can be seen below! The parent of the gen_server:start/1-ed process is itself!

Sys : get_status (<0.37.0>) = { status ,<0.37.0>, { module , gen_server }, [[{ '$ancestors' ,[<0.36.0>]}, { '$initial_call' ,{ test , init ,1}}], running ,<0.37.0>,[], [{ header , "Status for generic server test" }, { data , [{ "Status" , running }, { "Parent" ,<0.37.0>}, { "Logged events" ,[]}]}, { data ,[{ "State" , ok }]}]]}

/G