But why are we using Spark Streaming ?

Apache Spark is one of the most popular distributed computing libraries out there and Spark Streaming is built-in library in Apache Spark and provides real time stream processing but eventually Spark Streaming is micro-batch oriented stream processing engine. Real-time streaming right! Micro-batches are real time, yes very real time as much as we continue to assume, anyways. Spark Streaming reads streams from source. Being more specific Spark community calls them “receiver” and that received data processes in micro batch jobs for each iteration. The iteration interval must be assigned at some point of development cycle. There are other alternatives such as Flink, Storm or Heron, Samza etc.

Okey but why are we actually using Kafka if we can already use Spark Streaming for the streaming?

Using direct streams through TCP socket maybe meaningless because there is no any parallelism but using Spark requires parallel processing and this is a very good reason to use Kafka. Kafka enables parallel streaming with a support named “partition” which is highly compatible to use with Spark’s “partition”. The other reason to use Kafka that maybe we have very high traffic or maybe your traffic has high saturation. In this case Kafka could help you to handle the traffic without loosing the data. Sending the traffic to Spark Streaming directly or other streaming processing libraries could be cruel because Spark Streaming doesn’t have ability to handle the coming traffic from single pipe.