Apache Spark is considered as a powerful complement to Hadoop, big data’s original technology. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. It has become mainstream and the most in-demand big data framework across all major industries. Spark has become part of the Hadoop since 2.0. And is one of the most useful technologies for Python Big Data Engineers.

This series of posts is a single-stop resource that gives spark architecture overview and it's good for people looking to learn spark.

Whole series:

Apache Spark architecture is based on two main abstractions:

Resilient Distributed Dataset (RDD)

Directed Acyclic Graph (DAG)

Let's dive into these concepts

RDD — the Spark basic concept

The key to understanding Apache Spark is RDD — Resilient Distributed Dataset. RDD contains an arbitrary collection of objects. Each data set in RDD is logically distributed among cluster nodes so that they can be processed in parallel.

Physically, RDD is stored as an object in the JVM driver and refers to data stored either in permanent storage (HDFS, Cassandra, HBase, etc.) or in a cache (memory, memory+disks, disk only, etc.), or on another RDD.

RDD stores the following metadata:

Partitions — a set of data splits associated with this RDD. They are located at the cluster nodes. One partition — the minimum data batch that can be processed by each node of the cluster;

Dependencies — a list of parent RDDs involved in the calculation, the so-called lineage graph;

Computation — a function to calculate children's RDD from parent RDD;

Preferred Locations — where it is best to place the computation by partitions (data locality);

Partitioner — how data is divided into partitions (by default it is partitioned using HashPartitioner );

An RDD can be recreated in the same way as the data it refers to because each RDD knows how it was created (by storing a lineage graph). In addition, RDD can be materialized, in memory or on disk.

Example:

>>> rdd = sc.parallelize(range(20)) # create RDD >>> rdd ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 >>> rdd.collect() # collect data on driver and show [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

RDD can also be cached and manually partitioned. Caching is beneficial when we use particular RDD several times (and can slow down our calculations otherwise where the whole lineage graph will be processed several times). And manual partitioning is important for proper data balancing between partitions. As a rule, smaller partitions allow us to distribute data more evenly, among more executors. Therefore, smaller partitions can boost tasks with more repartitions (data reorganization during computation).

Let's check the number of partitions and data on them:

>>> rdd.getNumPartitions() # get current number of paritions 4 >>> rdd.glom().collect() # collect data on driver based on partitions [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]]

All the interesting things that happen in Spark happen through RDD operations. That is, usually Spark applications look like the following — we create an RDD (for example, we read data as a file from HDFS), transform it ( map , reduce , join , groupBy , aggregate , reduce , ...), do something with the result (for example, throw it back to HDFS).

Two types of operations can be performed via RDD (and, accordingly, all work with data is performed in the sequence of these two types): transformations and actions.

Transformations

The result of applying this operation to RDD is new RDD. As a rule, these are operations that somehow transform the elements of the given data.

Transformations by their nature are lazy, i.e. when we call some operation on RDD, it is not executed immediately. Spark keeps a record of which operation is called (via DAG, we will talk about it later).

We can assume that Spark RDD is a distributed data set on which transformations are applied. Due to the laziness of the transformations, we can directly start the execution of operations at any time by triggering an action on RDD. Consequently, data is not loaded until it is not needed until an action is triggered. This gives a lot of possibilities for low-level optimizations.

At a high level, two groups of transformations can be applied to RDD, namely narrow transformations, and wide transformations.

Narrow transformation does not require shuffle or reorganizing data between partitions. For example, map', filter', etc. Narrow transformations will be stacked together, allowing such transformations to be performed in parallel on different partitions.

Shuffle occurs upon data regrouping between partitions. This is necessary when the transformation requires information from other partitions, such as summing up all values in a column. Spark will collect the necessary data from each partition and consolidate it into a new partition, most likely on another executor.

But there are exceptions, operations like coalesce may cause the task to work with multiple input partitions, but the transformation will still be considered narrow because the input records used to compute any output record can still be found only in a limited subset of partitions.

Let's use filter transformation on our data:

>>> filteredRDD = rdd.filter(lambda x: x > 10) >>> print(filteredRDD.toDebugString()) # to see the execution graph; only one stage is created (4) PythonRDD[1] at RDD at PythonRDD.scala:53 [] | ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 [] >>> filteredRDD.collect() [11, 12, 13, 14, 15, 16, 17, 18, 19]

In this example we don't need to shuffle the data, each partition can be processed independently.

In this example, no shuffle data is required, each partition can be processed independently.

However, Spark also supports wide dependencies (namely wide transformations) such as groupByKey , reduceByKey , etc. Within these dependencies, the data needed for the processing can be located in several partitions of the parent RDD that need to be combined. To implement these operations, Spark must perform a shuffle by moving data across the cluster and forming a new stage with a new set of partitions, as in the example below:

>>> groupedRDD = filteredRDD.groupBy(lambda x: x % 2) # group data based on mod >>> print(groupedRDD.toDebugString()) # two separate stages are created, because of the shuffle (4) PythonRDD[6] at RDD at PythonRDD.scala:53 [] | MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:133 [] | ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:0 [] +-(4) PairwiseRDD[3] at groupBy at <ipython-input-5-a92aa13dcb83>:1 [] | PythonRDD[2] at groupBy at <ipython-input-5-a92aa13dcb83>:1 [] | ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []

Actions

Actions are applied when it is necessary to materialize the result — save the data to disk, write the data to a database or output a part of the data to the console. The collect operation that we have used so far is also an action — it collects data.

The actions are not lazy — they will actually trigger the data processing. Actions are RDD operations that produce values that are not RDD.

For example, to get the sum of our filtered data, we can use the reduce operation:

>> filteredRDD.reduce(lambda a, b: a + b) 135

DAG

Unlike Hadoop, where the user has to break down all the operations into smaller tasks and chain them together in order to use MapReduce, Spark defines tasks that can be computed in parallel with the partitioned data on the cluster. With these defined tasks, Spark builds a logical flow of operations that can be represented as a directional and acyclic graph, also known as DAG (Directed Acyclic Graph), where the node represents an RDD partition and the edge represents a data transformation. Spark builds the execution plan implicitly from the application provided by Spark.

DAGScheduler computes a DAG of stages for each job. A stage consists of tasks based on input data partitions. DAGScheduler merges some transformations together, for example, many map operators can be combined into one stage. The end result of DAGScheduler is an optimal set of stages in the form of TaskSet . Then the stages are passed to TaskScheduler. The number of stage tasks depends on the number of partitions. TaskScheduler launches tasks through cluster manager. TaskScheduler doesn't know about the dependencies of stages.

RDDs can determine preferred locations for processing partitions. DAGScheduler places computation so that it will be as close to the data as possible (data locality).

Previous post

Next post