This tutorial is for Spark developper’s who don’t have any knowledge on Amazon Web Services and want to learn an easy and quick way to run a Spark job on Amazon EMR.

This tutorial is the first of a serie I want to write on using AWS Services (Amazon EMR in particular) to use Hadoop and Spark components.

AWS and Amazon EMR

AWS is one of the most used cloud services platform, a lot of services are available, it is very well documented and easy to use.

A cloud services platform allow users to access on-demand resources (compute power, memory, storage) and services (databases, monitoring, workflow, etc.) via the internet with pay-as-you-go pricing.

Among all the cool services offered by AWS, we will only use two of them :

Simple Storage Service (S3), a massively scalable object storage service

Elastic MapReduce (EMR), a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark. It can be view like Hadoop-as-a-Service, you start a cluster with the number of nodes you want, run any job you want and only pay for the time the cluster is actually up.

You can test a lot of AWS services using the free trial account but for now, EMR is not included so you will have to pay some fees (less than $0.15/hour for a small cluster), but this is nothing compared to all the benefits offered by AWS.

Warning : The bills can be pretty expensive if you forget to shut down all your instances !

Now that we introduced these two services, let’s start using it to run a simple word count !

Tutorial

The aim of this tutorial is to launch the classic word count Spark Job on EMR. The input and output files will be store using S3 storage. All steps are simples and I will explain how to do it using both AWS UIs and the AWS CLI tool.

Here is the word count Spark code (using Spark 2.2.0) which will be used :

Create an AWS account

First of all, create a free trial AWS account on AWS website. The form is simple and quick to fill, you will need to enter your credit card information which will be used for Amazon EMR pricing and for other services if you use more than the free trial account quotas.

For those who want to use AWS CLI

If you don’t want to waste time doing the different steps of this tutorial using Amazon Graphic User Interfaces and want to automate it, you can install AWS CLI using this documentation.

Choose a region on which deploy the cluster

Depending on the date you signed up for an AWS account, the default region when you access a resource from the AWS Management Console is different. Each region have his own type of resources and pricing. The aim of this tutorial is to present how to run a Spark job, we don’t really care about performances. That is why, to minimize the Amazon bill, I invite you to compare the Amazon EMR pricing and Amazon S3 pricing to choose the cheapest region and configuration.

I didn‘t know that when I launched an Amazon EMR cluster for the first time. My bill was three times the price I could have pay if I deployed the cluster on another region.

You can change the region by clicking at the up-right corner on each of the different AWS Management console