Enter Apache Flink

Apache Flink is based on the idea that, it should not be hard to express such simple computations (like AVG coupled with GROUP BY in this case) while still be able to scale indefinitely, in a fault-tolerant manner. Most of the problems described above, have little to do with the nature of actual computation that you want to do. For those familiar with Map-Reduce, this is Map-Reduce for streaming data.

In our case, all we need to tell Flink is that we need to partition the input stream based on the industry name of the stock, and apply a moving average on a window of 1 minute.

You can specify how much parallelism you want for each of these subtasks. You could have, say 2 readers, reading from the source, and 4 time-window operators for applying the moving average. Each time-window instance will be responsible for only 25% of the keys (industry names of the stocks). Flink will ensure that both the readers send the trades of stocks belonging to the same industry to the correct time-window operator.

If you now want to scale up, you can simply increase the parallelism of readers if you need to absorb data at a higher rate. If the numbers of industries have increased, you can simply increase the number of time-window operators. It is that simple.

Source: Apache Flink

As a programmer, you get to only worry about your computational logic, and the framework takes care of converting it into a parallel execution plan taking care of all the routing and other complexities accompanied with it. The easiest way to get started is by getting your hands dirty here.

Why we chose Flink at ArchSaber

Our key requirements for which Flink was a perfect choice were as follows:

Write Once, Read Many

We collect on the order of thousands of distinct streams of data from a typical customer. This data enters once into our system, and is read many times since it powers our dashboards. To optimize the read use case, we wanted to pre-aggregate this data across several useful dimensions, before putting it in persistence. It is easy to express quite complex aggregates in a scalable and fault tolerant manner with Flink.

Low latency

The latency between when an anomalous data point reaches ArchSaber, and when we generate alert on it is key. This ruled out technologies like Spark, which natively view data as a batch (or a micro-batch for streaming). We also considered Apache Storm, but the poor documentation and community support raised a red flag.