In June 2018 users started reporting slack-side disconnects in their Ruby bots, reported via slack-ruby-client#208, regardless of the concurrency library (Eventmachine or Celluloid) used. I was already observing this behavior in my larger production bots, especially playplay.io with hundreds of teams.

Because this was a new behavior I assumed that the client was not seeing or handling some new type of communication error correctly at the websocket library layer. This has been an issue before, and in the case of Celluloid, IO exceptions had to be handled by emitting a :close event here. That is, the websocket library was seeing a problem and raising an exception, but the client didn’t properly emit a close event, so the bots just sat there waiting for something to happen.

def run_loop loop { read } rescue EOFError , Errno :: ECONNRESET , Errno :: EPIPE => e driver . emit ( :close , WebSocket :: Driver :: CloseEvent . new ( 1001 , 'server closed connection' )) end

The Faye::WebSocket::Client implementation has not seen a similar problem as it deferred reading data to WebSocket::Driver::StreamReader , which seemed to bubble up disconnects properly until these reports.

The server-side disconnects had started happening without any code changes, so naturally I assumed this was new behavior at Slack and opened a ticket on June 13. Slack quickly looped in @aoberoi from the SDKs team to help, but he found nothing wrong on the Slack side.

Bandaids

To relieve the immediate problem I stuck a restart into my bots (eg. slack-strava@bc29324) and began seeing about a dozen of these server-side disconnects every hour.

def cron once_and_every 60 * 60 do ping_teams! end end def ping_teams! Team . active . each do | team | begin ping = team . ping! next if ping [ :presence ]. online logger . warn "DOWN: #{ team } " after 60 do ping = team . ping! unless ping [ :presence ]. online logger . info "RESTART: #{ team } " SlackStrava :: Service . instance . start! ( team ) end end rescue StandardError => e logger . warn "Error pinging team #{ team } , #{ e . message } ." end end end

Gross, but effective.

Async Implementation

Because both the asynchronous library and the websocket implementation were used by thousands of projects, I naturally assumed slack-ruby-client would be more likely to have a bug. The celluloid-io code that I wrote was not detecting server-side disconnects occasionally, therefore we must have not been using celluloid-io correctly in concurrency/celluloid.rb. I summarized my conclusion in celluloid-io#187 and asked for help.

It turned out that Celluloid::IO was effectively unmaintained and I was told to switch to async. While that was a viable idea, replacing the concurrency library could easily create many new bugs without fixing the underlying issue. I wanted to understand the root cause, first.

Websocket Ruby Driver

Interestingly, both Celluloid and EventMachine used websocket-ruby-driver and, since both concurrency libraries exhibited a similar problem, I opened websocket-driver-ruby#61 and asked for ideas. I got a really helpful response from @ioquatix, the author of async.

I didn’t realize the socket read could return nil , meaning something like a disconnect. A read of length 0 returned nil . You also needed to handle empty strings, which may have meant that the client was disconnected. That’s EPIPE , EOFError , ECONNRESET , nil and "" .

Adding the code to handle all of the above in slack-ruby-client#216 solved nothing.

What else could go wrong?

There was some speculation that the bug was in Celluloid::IO::SSLSocket . It wasn’t.

Async-Websocket

Having no good ideas I ventured into replacing EventMachine and Celluloid with Async in slack-ruby-client#219 with a lot of help. That solved nothing, as expected, but replaced an unsupported library with a supported one.

Websocket-Level Ping Monitoring

We naturally reached the obvious conclusion that there was no bug and that the protocol required an active heartbeat, and that there was none implemented.

@RodneyU215 from Slack stepped in and added ping monitoring in slack-ruby-client#226 following Slack’s RTM guidelines (see Ping and Pong). The entire discussion and the evolution of that code is worth a read. I notably learned that Ruby Time.now was not monotonic.

I found the final code in slack-ruby-client#226 to be a rather elegant and straightforward example of a complex asynchronous process: a worker sends ping messages to Slack, which responds with a pong captured by the listening process. We periodically check whether a ping has been received within a reasonable timeframe and forcibly abort the connection, causing a reconnect, if it wasn’t.

We further refactored the implementation in slack-ruby-client#262, extracting and improving the keep-alive logic. Finally, slack-ruby-client#279 addressed handling of errors in the ping worker itself preventing it from dying.

Released Libraries

The active ping monitoring was released in slack-ruby-client 0.14.1, slack-ruby-bot 0.12.0 and slack-ruby-bot-server 0.9.0. Subsequent releases all the way up to 0.14.3 continue addressing various newly identified issues.

Thanks

Again, much thanks to Slack’s team for debugging, support and code, including Rodney Urquhart, Ankur Oberoi and Jason Roche, as well as Samuel Williams for his Async library, his adult supervision and for catching tons of non-obvious hard bugs across multiple code reviews.