1. Run a Spot Instance for ML

First things first. Let’s learn how to create a spot instance where we will be able to develop and run ML models. We want to use P2 instances. They come with one or more powerful NVIDIA K80 GPUs with lots of memory (11 GB) to test and train your models on. P2 comes in 3 sizes:

Let’s see how we can actually get one ourselves.

1.1 Tools Needed

Install AWS cli. AWS cli is a command line utility that can be used instead of the web-based AWS Console to manage AWS services.

Then run aws configure to set your key, secret and region. Regions that AWS supports P2 instances in are N. Virginia (us-east-1), Oregon (us-west-2) and Ireland (eu-west-1). Usually you want to choose the region that’s geographically closest to you.

to set your key, secret and region. Regions that AWS supports P2 instances in are N. Virginia (us-east-1), Oregon (us-west-2) and Ireland (eu-west-1). Usually you want to choose the region that’s geographically closest to you. Finally, we download helper scripts, that will assist us in the setup:

1.2 Virtual Private Cloud (VPC)

Before we can start any P2 instances, we need to setup a Virtual Private Cloud (VPC). Which is just a fancy virtual network to launch your virtual machine in. Setting up a VPC can be a little intimidating. It certainly was for me when I first did it, and the details are still a bit fuzzy. Good news is it has to be done only once. One way to approach this is to follow Amazon’s guide.

A better approach would be to use scripts adapted from Fast.ai’s course Deep Learning For Coders. If you got the helper scripts from Needed Tools above, simply run the following:

. ec2-spotter/fast_ai/create_vpc.sh

This will create a VPC, Internet Gateway, Subnet, Route Table, Security Group and most importantly a Key Pair. We will use the newly created key (located at ~/.ssh/aws-key-fast-ai.pem )to connect to the instance we are about to create. It will also print the ID of our newly created Subnet and Security group. We’ll need these for the next step.

1.3 Create the Instance

We can follow Amazon’s instructions for launching a spot instance. But because we are cooler than that, we could use a little helper script named start_spot_no_swap.sh to launch the instance.

We need to pass it the following arguments:

ami —Depending on which region we have picked and whether we want to use Fast.ai image or the Amazon one, we need to select an image. (Amazon images below are updated to version 1.3 from April 2017).

subnetId — Use the subnet ID that create_vpc.sh printed.

— Use the subnet ID that printed. securityGroupId — Use the security group ID that create_vpc.sh printed.

For example:

. ec2-spotter/fast_ai/start_spot_no_swap.sh --ami ami-53b23433 --subnetId subnet-9f69c3d6 --securityGroupId sg-a62f2ede

The script will then print the IP of our new Spot instance.

If we want, we might also pass the following: volume_size (size of the root volume, in GB. Default 128), key_name (name of the key file we’ll use to log into the instance. Default: aws-key-fast-ai), ec2spotter_instance_type (type of instance to launch. Default p2.xlarge), bid_price (The maximum price we are willing to pay (USD). Default 0.5).

1.4 Login and test

Using the IP of the Spot instance from the previous step, we can connect via ssh:

instance_ip=instance_ip_from_previous_step

ssh -i ~/.ssh/aws-key-fast-ai.pem ubuntu@$instance_ip

Now we can develop and test ML models to our hearts delight. For example, let’s test it with Tensorflow’s tutorial on MNIST:

python src/tensorflow/tensorflow/models/image/mnist/convolutional.py

yields:

...

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:

name: Tesla K80

major: 3 minor: 7 memoryClockRate (GHz) 0.8235

pciBusID 0000:00:1e.0

Total memory: 11.17GiB

Free memory: 11.11GiB

...

Step 0 (epoch 0.00), 771.1 ms

Minibatch loss: 8.334, learning rate: 0.010000

Minibatch error: 85.9%

Validation error: 84.6%

Step 100 (epoch 0.12), 12.2 ms

Minibatch loss: 3.262, learning rate: 0.010000

Minibatch error: 6.2%

Validation error: 7.3%

All seems good!

We have a Virtual Private Network, a Spot instance instance running in it for a fraction of the price, and even a model training. But all is not roses!

What if with hard work and wit, we manage to get the above MNIST script to achieve above state-of-art accuracy. And then we shut our instance for the night. Our great model would be lost. We need to find a way to persist the data on our Spot instances. Luckily, we found two.