Apache Flink is a popular real-time data processing framework. It’s gaining more and more popularity thanks to its low-latency processing at extremely high throughputs in a fault-tolerant manner.

I have been using Apache Flink in production for the last three years, and every time it has managed to excel at any workload that is thrown at it. I have run Flink jobs handling datastream at more than 10 million RPM with not more than 20 cores. It’s not just me. You can see the benchmarks from all other companies.

You can find the official benchmarks here.

So the natural question which comes to our mind is, How does Flink manage to scale efficiently?

Here are some of the neat tricks.

Reduce Garbage Collection

When you are operating on large amounts of data in Java, garbage collection can quickly become a bottleneck. A full GC can stall the JVM for seconds and even minutes in some cases.

Flink takes care of this by managing memory itself. Flink reserves a part of heap memory (typically around 70%) as Managed Memory. The Managed Memory is filled with memory segments of equal size (32KB by default). These memory segments are similar to java.nio.ByteBuffer and contain only byte arrays.

Whenever an operator wants to use memory, it requests segments from the memory manager and once done, returns them to the pool. Since these memory segments are long-lived and reused continuously, they reside in the Old Generation of Heap and don’t have to go through many GC cycles.

Flink also provides the functionality to put memory segments to off-heap memory for faster I/O to network and File system, especially for stateful operators.

Another advantage of Managed Memory is that Flink can destage larger segments to disk and read them back later. This spilling helps in preventing Out of Memory errors.