In order to understand the way Spark runs, it is very important to know the architecture of Spark. Following diagram and discussion will give you a clearer view into it.

There are three ways Apache Spark can run :

Standalone – The Hadoop cluster can be equipped with all the resources statically and Spark can run with MapReduce in parallel. This is the simplest deployment.

On Hadoop YARN – Spark can be executed on top of YARN without any pre-installation. This deployment utilizes the maximum strength of Spark and other components.

Spark In MapReduce (SIMR) – If you don’t have YARN, you can also use Spark along with MapReduce. This reduces the burden of deployments.

Either way the Spark is deployed, the configurations allocates it the resources. The moment Spark is connected it obtains executors on the node. These executors are nothing but the processes running computations and securing the data. Now the application code is sent to the executor following with SparkContext sending tasks to the executors to run.

Some important terms to illustrate the architecture are –

Apache Spark Basics

Spark Application: These are the user programs built on Apache Spark. Application jar: A container having Spark applications. Spark Driver program: This is a program that runs the main () function of the Spark application. Cluster manager: A service to acquire the resources. Deploy mode: The configuration on which driver process runs. Worker node: A node that executes application programs on it. Spark Executor: A process assigned for each application which runs applications and stores the data. Job: Consists of multiple tasks launched in response to a Spark action. Task: It is sent to executor by SparkContext.