Server Check.in uses a small Node.js application to perform checks on servers and websites (to determine whether and how quickly they're responding to pings and http requests), while the main website and backend is built on Drupal and PHP.

One of my main goals in building Server Check.in with a modular architecture (besides being able to optimize each part using the best technologies available) was to ensure that the infrastructure and backend could easily scale as the service grows.

I'm happy to announce that Server Check.in now has multiple servers performing status checks (there are currently three check servers in the USA; more will be added soon!), and I'd like to explain how the Server Check.in architecture works and why it's built this way.

Priorities, priorities

Some people have asked why Server Check.in only used one server to do status checks initially—especially in light of the fact that most other uptime monitoring services tout tens or hundreds of probe servers as a headline feature. There are two reasons I chose to wait to expand to multiple check servers until recently:

Simplicity and stability trumps features.

Now, this is not true for every situation, everywhere, but I've found through the years that people prefer a dependable, reliable, and simple solution (as long as it meets their basic needs) over a flaky solution, no matter how many features it has.

Server Check.in's core product is the sending of a notification when a server goes down, and comes back up. If Server Check.in fails to notify you when your server is down, it failed. And if Server Check.in tells you your server is down when it's not, it failed even worse, because you start ignoring notifications.

Therefore, for the first few months of Server Check.in's existence, most of the development time was spent refining the accuracy and speed of our server checks, and making sure every notification was delivered successfully. As a direct result of these efforts, check frequencies have been increased and a new premium plan was added with even more value for the price. Server Check.in has also been up more than 99% of the time most months since launch!

Get the architecture right, or die slowly.

A major factor in choosing what new features go into Server Check.in is the limited development time available. To be honest, since Server Check.in is one of Midwestern Mac's side projects (a selfish one at that—I just want to be able to easily and inexpensively monitor my own client's servers!), there aren't a ton of resources when it comes to developing new features or making large architectural changes.

Therefore, before I embarked on a specific architecture for distributing server checking across many servers, I wanted to make sure my architecture was sound, and would be easy to scale horizontally.

Other projects I've worked on have died a long, slow death because small architectural decisions made at the beginning of the project slowed down or halted development at a later time. Not only did development time have to be reallocated to maintenance and bug fixes, developers themselves were demoralized as they didn't get to spend much time on creating shiny new features or improving more interesting parts of the system. It's a vicious cycle.

While it does mean features come at a slower pace, keeping a lid on technical debt has allowed me to stay interested in building new features for Server Check.in's and making it faster.

New Server Checking Infrastructure

Server Check.in now has multiple Node.js servers running our server checking application, and all of these servers communicate with our central application server to see if there are servers that need to be checked, then post back the check results. All data is transferred via a simple internal RESTful API, and with the current architecture, I'm confident Server Check.in can handle at least a few hundred check servers and thousands more clients without any extra work on the backend.

The internal API communication between the Node.js servers and the main application server is extremely simple, and this is what makes it so powerful. Put another way: by keeping the distributed part of our architecture simple, I avoid making an already complex situation (multiple servers communicating over private LANs or the public Internet) even more complex.

Additionally, our application servers (VPSes hosted in geographically disparate locations with different hosting providers for better redundancy) are using centralized code and configuration management (more on this in a future blog post, I hope!), so boring server management and deployments are trivial.

This architecture will also be very helpful new types of checks and other new features are added, since everything will be distributed among all our check servers automatically. More time can be spent on developing new features rather than managing infrastructure and architecture.

Please let me know what you think of this post below, or on Hacker News or Reddit.