Hello Folks 🙂

The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough.

Data has to be processed fast so that a firm can react to changing business conditions in real time.

Stream processing is the real-time processing of data continuously and concurrently.

For that, I have started learning Apache Spark technology as it processes data in batch mode as well as in real time.

Apache Spark is an open source, general-purpose & lightning fast cluster computing system. It provides a high-level API. For example, Java, Scala, Python and R. Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Big Data Hadoop and 10 times faster than accessing data from disk.

Spark also provides interactive processing, graph processing, in-memory processing as well as batch processing of data with very fast speed, ease of use and standard interface.

Spark is not only a Big Data processing engine. It is a framework which provides a distributed environment to process data. This means we can perform any type of task using Spark.

To see its performance, let’s take an example of factorial.

Calculating factorial for the very large number is always cumbersome in any programming language. CPU will take much time to complete the calculation.

I have written a factorial function using two ways:

Using tail recursion in Scala

Factorial function using Spark

def factorial(num: BigInt): BigInt = { def factImp(num: BigInt, fact: BigInt): BigInt = { if (num == 0) fact else factImp(num - 1, num * fact) } factImp(num, 1) }

The time is taken by the above code to find the Factorial of 200000 on my machine (Quad Core Intel i5) was about 20s.

def factorialUsingSpark(num: BigInt): BigInt = { if (num == 0) BigInt(1) else { val list = (BigInt(1) to num).toList sc.parallelize(list).reduce(_ * _) } }

The time taken by Spark to find the factorial of 200000 on the same machine was only 5s, which is almost 4x faster than using Scala alone.

Computation does depend on the hardware of system but at least it gives us an idea how spark efficiently processes complex computations.

So, this was my first step to learn Spark with Scala. I know that it is not much, but I still need to explore more in Spark like RDD, Data frames, Structured Streaming, etc., about which I will be writing in my future posts. So, stay tuned 🙂

The complete code can be downloaded from Github.

Comments and Suggestions are welcomed.