We have used Github’s Brubeck as our front-end UDP collecting solution during a long time (check our later posts: 1, 2). It’s written in C, a very simple code from the code`s point of view (which is important when you want to contribute), and, what is most important is that it was able to handle our huge loads of 2m metrics per second(MPS) at peek. Brubeck’s documentation states the support or 4m MPS with notice. The notice is about Linux network tuning which is required to get such a big value. Apart from all the advantages above, there are some serious shortcomings associated with this solution.

Shortcoming #1. A developer — Github, stopped supporting the project after publishing it. They didn’t respond to issues, didn’t merge PRs, including ours. As for the moment of writing this article, the project became a bit more alive again (starting from around Feb 2018), but before this time, it had been experiencing an almost 2-year period of total inactivity. We also found a comment on one of PRs with one of the authors saying they only accept PRs useful for Github. That could become a serious stopper for us in the future.

Shortcoming #2. The precision of the data. As you can see from the code, that brubeck samples 65536 metric values per metric only. In our case, we get a lot more than this value during aggregation period (1 527 392 is that we’ve seen at peak). As a result of such sampling “max” and “min” aggregates become useless:

(what it looks like)

(what it should look like)

For the same reason, sum aggregation is incorrect as well. Append here the 32-bit float overflow bug (which is still not fixed by the way) leading to server segfault.

And, at last, The Issue #X. The one, which we are ready to put on all 14 statsd implementations we were able to find at the moment of writing this article. Imagine some infrastructure growing big enough, that number of metrics collected there exceeds theoretical limit ot 4m MPS significantly. Imagine even, it hasn’t grown yet, but collected metrics has already become significant to some big bosses. Imagine the level of importance where 2–3 minute absence of data is critical enough for managers to fall into deep depression. Obviously, treating depression is not what techs are doing best, so we prefer to find a technical solution to resolve this problem. The solutions are well known and obvious:

At first, the fault tolerance — we don’t want the single failure on one of the servers to create a psychiatric zombie apocalypse at the office.

At last — the scalability. We need to find a way to receive more that 4m MPS without digging deep into network stack while being able to grow vertically to the neсessary size.

Since we knew we had the scalability buffer already, we decided to start with the fault tolerance. “Hey, it’s just a fault tolerance, we did this hundreds of times before, it’s simple”, we thought. So we took two copies of brubeck and ran it in parallel. For doing that we wrote a small UDP traffic multiplexing utility. So the fault tolerance problem seemed to be solved, but… ehmm.., not very good. It all seemed to go well: every brubeck instance got its own variant of aggregation and sent it to Graphite every 30 seconds overwriting previous interval (Graphite is able to do things like that). If one server caught a failure, we had another one containing full copy of aggregated data. There was a problem here: the 30-second aggregation periods were not synchronised between nodes, so every time server had a problem or just stopped for maintenance, you could see the “saw” on the graphs, because one of last aggregations was lost and didn’t get overwritten. The same thing happened when stopped server was relaunched.

Also, the scalability problem was not solved with this scheme because we still got 2–4m MPS per server and still cannot increase them. If you try to find a solution and try to dig some snow (we have snowy winters in Russia!) you may come to the same idea as we came to: we need a statsd that can work in distributed mode. This means we need to keep metrics in sync between nodes considering their timestamp. “Of course there is such a solution”, we said and went googling…. And found nothing. Digging throught documentation of many implementations(as of 11.12.2017) we found literally nothing! Seems like other developers and sysadmins have never met such a problem yet or said nothing about their solutions.

And then we recalled we wrote a toy statsd implementation for a hackaton project. We called it Bioyino, which means nothing because it’s the name the hackaton script generates for each project semi-randomly. And then it crossed our minds, that we need our own statsd implementation. Here’s the arguments for it:

because there’s too little statsd clones in the world,

because we can get the fault tolerance and scalability level that we really need and solve problems mentioned above, at least the metric sync and metric sending conflicts,

because we can ensure the precision, that is better than the brubeck gives us,

because we can add more useful statistic information about incoming and outcoming metrics, which was almost absent in brubeck,

because we have got a chance to program our own high performance distributed scalable application, which is not a clone of another high perfor… you get it.

What language to choose? Of course Rust! but why?

because there was a prototype,

because the author of this article already knew Rust and was nuts about writing an open sourced solution in it,

because we cannot afford GC languages that affect the almost-real time traffic of infinitely incoming metrics,

because we needed top performance comparable to C,

because Rust gives us fearless concurrency without overhead, and if we wrote this in С/С++ (which we definitely know worse), we would surely get more security issues, race conditions etc.

There was an argument against using Rust. Our company has no experience in projects on Rust. And we are not going to use it in our main project in the future. So, there were serious concerns. But we ran the risk of trying it.