Most of the PySpark tutorials out there use Jupyter notebooks to demonstrate Spark’s data processing and machine learning functionality. The reason is simple. When working on a cluster, notebooks make it much easier to test syntax and debug Spark applications by giving you quick feedback and presenting error messages within the UI. Otherwise, you would have to dig through log files to figure out what went wrong — not ideal for learning.

Once you’re confident your code works, you may want to integrate your Spark application into your systems. Here, notebooks are much less useful. To run PySpark on a schedule, we need to move our code from a notebook to a Python script and submit that script to a cluster.

Submitting Spark applications to a cluster from the command line can be intimidating at first. My goal is to demystify the process. This guide will show you how to use the AWS Command Line Interface to:

Create a cluster that can handle datasets much larger than what fits on your local machine. Submit a Spark application to the cluster, that reads data, processes it, and stores the results in an accessible location. Auto-terminate the cluster once the step is complete, so you only pay for the cluster while you’re using it.

Spark Development Workflow

When developing Spark applications for processing data or running machine learning models, my preference is to start by using a Jupyter notebook for the reasons stated above. Here’s a guide to creating an Amazon EMR cluster and connecting to it with a Jupyter notebook.

Once I know my code works, I may want to put the process in play as a scheduled job. I’ll put the code in a script so I can put it on a schedule with Cron or Apache Airflow.

Production Spark Applications

Create your AWS account if you haven’t already. Install and configure the AWS Command Line Interface. To configure the AWS CLI, you’ll need to add your credentials. You can create credentials by following these instructions. You’ll also need to specify your default region. For this tutorial, we’re using us-west-2 . You can use whichever region you want. Just be sure to use the same region for all of your resources.

Defining a Spark Application

For this example, we’ll load Amazon book review data from S3, perform basic processing, and calculate some aggregates. We’ll then write our aggregated data frame back to S3.

The example is simple, but this is a common workflow for Spark.

Read the data from a source (S3 in this example). Process the data or execute a model workflow with Spark ML. Write the results somewhere accessible to our systems (another S3 bucket in this example).

If you haven’t already, create an S3 bucket now. Make sure the region you create the bucket in is the same region you use for the rest of this tutorial. I’ll be using region “US West (Oregon)”. Copy the file below. Be sure to edit the output_path in main() to use your S3 bucket. Then upload pyspark_job.py to your bucket.

# pyspark_job.py from pyspark.sql import SparkSession

from pyspark.sql import functions as F def create_spark_session():

"""Create spark session. Returns:

spark (SparkSession) - spark session connected to AWS EMR

cluster

"""

spark = SparkSession \

.builder \

.config("spark.jars.packages",

"org.apache.hadoop:hadoop-aws:2.7.0") \

.getOrCreate()

return spark def process_book_data(spark, input_path, output_path):

"""Process the book review data and write to S3. Arguments:

spark (SparkSession) - spark session connected to AWS EMR

cluster

input_path (str) - AWS S3 bucket path for source data

output_path (str) - AWS S3 bucket for writing processed data

"""

df = spark.read.parquet(input_path)

# Apply some basic filters and aggregate by product_title.

book_agg = df.filter(df.verified_purchase == 'Y') \

.groupBy('product_title') \

.agg({'star_rating': 'avg', 'review_id': 'count'}) \

.filter(F.col('count(review_id)') >= 500) \

.sort(F.desc('avg(star_rating)')) \

.select(F.col('product_title').alias('book_title'),

F.col('count(review_id)').alias('review_count'),

F.col('avg(star_rating)').alias('review_avg_stars'))

# Save the data to your S3 bucket as a .parquet file.

book_agg.write.mode('overwrite')\

.save(output_path) def main():

spark = create_spark_session()

input_path = ('s3://amazon-reviews-pds/parquet/' +

'product_category=Books/*.parquet')

output_path = 's3://spark-tutorial-bwl/book-aggregates'

process_book_data(spark, input_path, output_path) if __name__ == '__main__':

main()

Using AWS Command Line Interface

It’s time to create our cluster and submit our application. Once our application finishes, we’ll tell the cluster to terminate. Auto-terminate allows us to pay for the resources only when we need them.

Depending on our use case, we may not want to terminate our cluster upon completion. For instance, if you have a web application that relies on Spark for a data processing task, you may want to have a dedicated cluster running at all times.

Run the command below. Make sure you replace the bold italicized pieces with your own files. Details on --ec2-attributes and --bootstrap-actions , and all of the other arguments, are included below.

aws emr create-cluster --name "Spark cluster with step" \

--release-label emr-5.24.1 \

--applications Name=Spark \

--log-uri s3://your-bucket/logs/ \

--ec2-attributes KeyName=your-key-pair \

--instance-type m5.xlarge \

--instance-count 3 \

--bootstrap-actions Path=s3://your-bucket/emr_bootstrap.sh \

--steps Type=Spark,Name="Spark job",ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--master,yarn,s3://your-bucket/pyspark_job.py] \

--use-default-roles \

--auto-terminate

Important aws emr create-cluster arguments:

--steps tells your cluster what to do after the cluster starts. Be sure to replace s3://your-bucket/pyspark_job.py in the --steps argument with the S3 path to your Spark application. You can also put your application code on S3 and pass an S3 path.

tells your cluster what to do after the cluster starts. Be sure to replace in the argument with the S3 path to your Spark application. You can also put your application code on S3 and pass an S3 path. --bootstrap-actions allows you to specify what packages you want to be installed on all of your cluster’s nodes. This step is only necessary if your application uses non-builtin Python packages other than pyspark . To use such packages, create your emr_bootstrap.sh file using the example below as a template, and add it to your S3 bucket. Include --bootstrap-actions Path=s3://your-bucket/emr_bootstrap.sh in the aws emr create-cluster command.

#!/bin/bash

sudo pip install -U \

matplotlib \

pandas \

spark-nlp

--ec2-attributes allows you to specify many different EC2 attributes. Set your key pair using this syntax --ec2-attributes KeyPair=your-key-pair . Note: this is just the name of your key pair, not the file path. You can learn more about creating a key pair file here.

allows you to specify many different EC2 attributes. Set your key pair using this syntax . You can learn more about creating a key pair file here. --log-uri requires an S3 bucket to store your log files.

Other aws emr create-cluster arguments explained:

--name gives the cluster you are creating an identifier.

gives the cluster you are creating an identifier. --release-label specifies which version of EMR to use. I recommend using the latest version.

specifies which version of EMR to use. I recommend using the latest version. --applications tells EMR which type of application you will be using on your cluster. To create a Spark cluster, use Name=Spark .

tells EMR which type of application you will be using on your cluster. To create a Spark cluster, use . --instance-type specifies which type of EC2 instance you want to use for your cluster.

specifies which type of EC2 instance you want to use for your cluster. --instance-count specifies how many instances you want in your cluster.

specifies how many instances you want in your cluster. --use-default-roles tells the cluster to use the default IAM roles for EMR. If this is your first time using EMR, you’ll need to run aws emr create-default-roles before you can use this command. If you’ve created a cluster on EMR in the region you have the AWS CLI configured for, then you should be good to go.

tells the cluster to use the default IAM roles for EMR. If this is your first time using EMR, you’ll need to run before you can use this command. If you’ve created a cluster on EMR in the region you have the AWS CLI configured for, then you should be good to go. --auto-terminate tells the cluster to terminate once the steps specified in --steps finish. Exclude this command if you would like to leave your cluster running — beware that you are paying for your cluster as long as you keep it running.

Check the Spark application’s progress

After you execute the aws emr create-cluster command, you should get a response:

{

"ClusterId": "j-xxxxxxxxxx"

}

Sign-in to the AWS console and navigate to the EMR dashboard. Your cluster status should be “Starting”. It should take about ten minutes for your cluster to start up, bootstrap, and run your application (if you used my example code). Once the step is complete, you should see the output data in your S3 bucket.

That’s it!

Final thoughts

You now know how to create an Amazon EMR cluster and submit Spark applications to it. This workflow is a crucial component of building production data processing applications with Spark. I hope you’re now feeling more confident working with all of these tools.

Get in touch

Thank you for reading! Please let me know if you liked the article or if you have any critiques. If you found this guide useful, be sure to follow me, so you don’t miss my future articles.

If you need help with a data project or want to say hi, connect with me on LinkedIn. Cheers!