We released the current version of Scaledrone over two years ago in Fall 2014. We were ramping up, trying to make a name for ourselves and got some sizeable customers on board. As the Node.js backends were not performing quite as we had hoped for, we were creating more and more servers to handle the load and ended up rewriting the Node.js backends in Go.

Sometimes bugs happen, and sometimes they end up in production. Every so often it was a mistake in our code but quite often in some external library we used. These bugs caused the backend processes to crash and restart themselves (one by one, not all at once of course) and all clients got reconnected to other socket servers, which are effectively stateless so not a big deal from that perspective. Normally users didn't take notice of it but for us as a service provider that most certainly wasn't the greatest solution. We then had to find a way to act quickly to discover and fix fatal issues before they happened again.

The simple yet reliable solution that we ended up using since came to life in our Slack channel where we hold most of our internal communication - why not send the most business critical events to a Slack channel where we'd get notified immediately?

After doing some research, we found out that other people have already thought of a similar idea and have made scripts for that exact purpose. Some examples found on GitHub are:

The solution

We decided to use the slacktee.sh script because it was the most similar to the tee command. To start using it you will have to get a webhook url for your Slack channel, put it into the slacktee.sh script and place the script to a safe location on the server, for example /usr/local/bin/slacktee .

The tee and slacktee commands work by getting their input from a standard stream and passing it along for other commands to consume. This is done with the pipe character ( | ) and looks like this:

echo "I have a message" | tee -a messages.txt

The echo command passes the message "I have a message" to the pipe from where the tee command picks it up and writes it to the messages.txt file. The tee command also passes the message along but as there are no pipes after it, the message is just written to the standard output. The quotes are there for delimiting the string so it would not be mixed up with other commands, they do not end up in the output.

The other piece of the puzzle is to configure the software that runs the process which we want to be monitored. That software is usually called process manager and one of the most popular ones currently is systemd.

An example of a systemd configuration which runs a server process and uses slacktee to notify about it crashing looks like this:

[Unit] Description=ScaleDrone Server After=syslog.target network.target [Service] Type=simple User=app Group=app WorkingDirectory=/opt/server ExecStartPre=-/bin/sh -c "echo 'Server starting on $(hostname)' | /usr/local/bin/slacktee" ExecStart=server --environment=production ExecStop=/bin/kill -HUP $MAINPID ExecStopPost=-/bin/sh -c "{ echo 'Server stopped on $(hostname), last lines from logs:'; /usr/bin/tail -n 22 /var/log/server.log; } | /usr/local/bin/slacktee" # Redirect logs to syslog where they are written to the /var/log/server.log file StandardOutput=syslog StandardError=syslog # In case if it gets stopped, restart it immediately Restart=always [Install] # multi-user.target corresponds to run level 3 # roughtly meaning wanted by system start WantedBy=multi-user.target

Systemd runs the command specified after ExecStart whenever it is told to. It may be done manually or after booting up the operating system. ExecStop is run when it is told to stop the process, for example when the system is shutting down.

The rows we are interested in start with ExecStartPre and ExecStopPost. ExecStartPre defines a command that is run before the ExecStart is run. ExecStopPost is run when the process is stopped, even when it has done it by itself (e.g. crashed).

The command after ExecStartPre and ExecStopPost might be a little hard to follow so we will explain them further.

/bin/sh spawns a new shell. A shell is necessary for piping which is described above. It is also required because systemd can only run commands or other scripts and does not allocate a whole shell for them since it is usually not needed.

The -c parameter and its value (a command enclosed in quotes) is the script that is run in the new shell:

echo 'Server starting on $(hostname)' | /usr/local/bin/slacktee

echo emits a message in which another command ( hostname ) is executed. The reason for this is that we can have a single script that can be deployed anywhere without any modifications. $(hostname) is replaced with the hostname of the server running the script. The message is then piped to /usr/local/bin/slacktee .

Note that full paths of scripts are needed because of the fresh shell we created earlier which has no idea about your PATH or any other variable you may have defined in your environment.

The code executed in ExecStopPost is a bit more complicated.

echo 'Server stopped on $(hostname), last lines from logs:';

This echo is a lot like the previous one, only with a longer message.

/usr/bin/tail -n 22 /var/log/server.log;

tail is a handy tool to get a number of lines at the end of the file, even if the file is millions of lines long. The -n parameter is used to specify how many lines to read from that file. Currenlty we read 22 lines.

Both of these commands are enclosed in curly brackets. The brackets gather the output of these two commands, concatenate and pass the result into the standard output stream.

Note that the semicolons after the commands are important and cannot be omitted when using the curly bracket notation.

The output of commands in ExecStopPost are piped into /usr/local/bin/slacktee as before.

👻

So that's about it. With those two lines (even ExecStopPost alone may be enough) you are actually aware how your systems are behaving.

EDIT on Mar 16th, 2017: Thanks for the people on Hacker News for pointing out that without prefixing the Pre and Post commands with a dash our service would not start if the slacktee script fails.