Deep Learning on Amazon EC2 Spot Instances Without the Agonizing Pain

7,061 reads

In my previous article I briefly described SageMaker — generic Machine Learning service from Amazon, launched in November 2017. Although SageMakes is a great service, it may not suit everyone. At least it did not work for us due to the way it handles the data and, most importantly, due to its high pricing. So, we turned to AWS EC2 and its Spot Instances.

First I am going to explain what Spot Instances are, how to use them for Deep Learning, and what does it take. Then I will introduce a tool that attempts to automate routine tasks associated with Spot Instances and hopefully makes the overall experience more pleasant. I am the author of the tool, so any feedback is very welcome.

I assume that you are familiar with AWS in general and EC2 service in particular. If not, please, take a look at the Resources section in the end of this article.

Spot Instances

Amazon EC2 service allows you to rent virtual servers (instances) with the specs you want and for the amount of time you need. You can request an On-Demand instance with a couple of clicks and normally do not have to wait for it, because EC2 maintains a pool of spare computational resources to allocate. In order to commercialize this spare computing capacity while still being able to provide it on-demand, Amazon offers Spot Instances.

Spot Instances are allocated from the EC2 spare capacity, without lifetime guarantees of On-Demand instances, but at a generous discount. Often you can get the same computational resources for the third of the normal price. The trick here is that these resources may be taken from you at any moment. Prices of Spot Instances are set based on the current supply and demand, which means that the more spare capacity is available, the cheaper it cost you. Price can go up and down following the demand and previously it was a bit chaotic at times. Back in the days there were complains regarding the unpredictable jumps in prices, so in late 2017 Amazon changed its pricing policy. Since then the prices of Spot Instances are adjusted gradually and predictably.

When you request a Spot Instance you actually place a bid for some spare capacity. You specify the maximum amount of money you are willing to pay per hour for an instance and, if the current Spot price is lower than your maximum price, you get the instance. If at some moment the price goes higher than your set maximum, the instances will be shut down and taken from you. At least you would not be surprised by the next bill.

When your main priority is to get the job done, you can set the price limit to be equal to the On-Demand price. In this case your instances should never be taken from you, and you still will be saving money as long as the Spot price is lower. At least this how it should be in theory, because there are absolutely no guarantees. It happened to us several times that our Spot Instances were revoked, even though we were willing to pay the On-Demand price. This might be specific to GPU instances, since probably the pool of them is relatively small and demand might be spiky at times.

In short, Spot Instances allow you to save money or do more with the same budget, but might be revoked at any time. Let’s see how to still do some Science in such uncertain conditions.

Deep Learning on EC2 Spot Instances

To use Spot Instances without sedatives, you should embrace the volatility and accept that they might go away at any moment. Moreover, you yourself will probably be making and canceling requests all the time, since you do not want to pay for idle instances spinning. When Spot Instances are stopped for any reason, they get terminated and deleted, so you need to automate preparation of your working environment and be careful not to loose any valuable data.

The main pledge of mental health when working with Spot Instances is persistent data storage. The most efficient option is Amazon EBS Volumes. Create a separate volume to store all your data and code, and attach it every time to a newly created Spot Instance. Every volume can be attached to one instance at a time. You can use the same or another attached volume to save checkpoints while training. This will allow you to restore the previous state of your training environment in minutes. Just be careful not to save new checkpoint right over the previous one, because saving takes some time and your instance might be occasionally terminated in the middle of this process. This would almost certainly corrupt the checkpoint. At least alternate between two saving locations in your training script.

When you are ready to run some computations, create a request for one or more Spot Instances. For instructions on how to make the request using online console or AWS CLI, please, check the official documentation here. Normally it takes a bit more than a minute for this request to be fulfilled. In rare cases, when requested resources are not available, you will have to wait for them. Since the waiting time can be pretty unpredictable, you might want to try to request another instance type instead. Keep an eye on the Spot Instance pricing history and Spot Instance Advisor to identify the best deal.

By looking at the Spot Instance pricing history you might notice that the price of the same instance type might be significantly different between AWS Regions and even Availability Zones within the same region. Yet an instance should be in the same region and the same Availability Zone as an EBS Volume that you want to attach to it. If you need to copy or move your data between Availability Zones, use EBS Snapshots which are not bound to Availability Zones.

Most likely for deep learning tasks you are interested in GPU instances. There are several kinds of instances with GPU: P2 and P3 are general-purposed, G3 are optimized for graphics-intensive applications. P2 instances are equipped with NVIDIA K80 GPUs (2,496 CUDA cores, 12GiB of GPU memory), while P3 instances feature NVIDIA Tesla V100 GPUs (5,120 CUDA Cores, 640 Tensor Cores, 16GiB of GPU memory). Note that some types might not be available in some regions.

EC2 imposes limits on how many instances of every type you are allowed to launch. Limits are set on a per-region basis. Default limits for GPU instances are set to one or even zero. Make sure to request an instance limit increase in advance, because these requests go through the AWS Support and thus might take one or two days to be fulfilled.

Now, what about the software? While requesting a Spot Instance you can specify which Amazon Machine Image (AMI) should be used as a template for it. AMIs capture the exact state of software environment: operating system, libraries, applications, etc. Multitude of pre-configured AMIs can be found in AWS Marketplace. There is a group of AMIs called Deep Learning AMIs created by Amazon specifically for deep learning applications. They come pre-installed with open-source deep learning frameworks including TensorFlow, Apache MXNet, PyTorch, Chainer, Microsoft Cognitive Toolkit, Caffe, Caffe2, Theano, and Keras, optimized for high performance on Amazon EC2 instances. These AMIs are free to use, you only pay for the AWS resources needed to store and run your applications. There are several different flavors of Deep Learning AMIs. Check the guide to know the difference between them.

Workflow

The overall training workflow might look like this:

Create an EBS Volume to store data, code, checkpoints, etc. Request a general purpose Spot Instance (e.g. M4 or M5), attach the EBS Volume to it and move your data and code there. Use this relatively cheap instance for initial preparation of the data. Shut down the instance. Request a GPU Spot Instance (e.g. P2 or P3) with Deep Learning AMI, attach the EBS volume to it, install whatever is missing and start training. Periodically save checkpoints. If (when) instance got terminated, repeat the request. If needed, change the instance type.

Obviously a lot of things can be automated in the above workflow. The next section is dedicated to the automation tool which should make the process much less tedious.

Automation

Meet Portal Gun — a command line tool written in Python that automates repetitive tasks associated with the management of Spot Instances.

Rick and Morty by Adult Swim.

Obviously the name “Portal Gun” was picked after a thorough consideration and reflects the purpose of the tool to provide an instant access to remote spots… of the Universe… or… whatever. Anyway, following the metaphor, you make a Spot Instance request by opening a portal and cancel the request by closing the portal.

Quick Start Guide

Full documentation can be found at http://portal-gun.readthedocs.io.

I. Installation and configuration

It is strongly recommended to install Portal Gun in a virtual Python environment.

To install the latest stable version from the PyPI do:

$ pip install portal-gun

Portal Gun reads basic configuration from a file in JSON format. Create a new file /path-to-virtual-python-environment/bin/config.json and specify AWS region and credentials for programmatic access to your AWS resources. Check the exact format of the config file here.

Portal Gun requires several permissions to access AWS resources. The most convenient way to grant the required permissions is via AWS IAM Policy. Reference policy including all required permissions can be found here.

II. Basic usage

Assuming that the virtual environment is activated, you call Portal Gun with:

$ portal <Command>

Print top-level help message to see the list of available commands:

$ portal -h

You can add -h (or --help ) flag after commands and command groups to print corresponding help messages.

III. Portals

Portal Gun was design around the concept of portals, hence the name. A portal represents a remote environment and encapsulates such things as virtual server (Spot Instance) of some type, persistent storage, operating system of choice, libraries and frameworks, etc.

To open a portal means to request a Spot Instance. To close a portal means to cancel the request and terminate the instance. For example, if you are training a model, you open a portal for a training session and close it, when the training is finished. If you follow the recommended workflow (see above), you should be able to open the portal again and find everything exactly like you left it before.

A portal is defined by a portal specification file which describes a particular environment in JSON format. You can create a draft specification file using:

$ portal init <Portal-Name>

A file with the name <Portal-Name>.json will be created. Open the file and fill the appropriate values. Please, refer to the documentation for details.

Open a portal with:

$ portal open <Portal-Name>

Notice that Portal Gun expects <Portal-Name>.json to be in the current folder. Usually it takes a bit longer than a minute for the open command to finish. Once the portal is opened, you can ssh to the remote machine with:

$ portal ssh <Portal-Name>

For long-running tasks like training a model it is particularly useful to be able to close current ssh session without interrupting the running task. For this you can use tmux. Simply add -t flag after the ssh command to automatically run tmux within the ssh session. Now you can run the long task within the tmux session and then safely detach from the session keeping the task running.

Check information about a portal with:

$ portal info <Portal-Name>

Information includes portal status (open or closed) and, if portal is open, also details about the instance and attached volumes.

Close a portal with:

$ portal close <Portal-Name>

IV. Persistent volumes

Portal Gun allows you to manage EBS volumes from command line. It also automatically attaches and mounts volumes to instances according to the portal specification.

Create a new EBS volume with:

$ portal volume create

Every volume requires size (in Gb) and availability zone to be specified. Name is optional, but recommended. If these three properties are not set using the command options, they will be requested from the standard input.

Upon successful creation of a new volume its <Volume-Id> will be provided.

You can list created volumes with:

$ portal volume list

In this way you can easily see existing volumes and check, which of them are available and which are in use by some instance. By default you only see volumes created by Portal Gun on behalf of your AWS user. To see all volumes add -a flag after the list command.

To update a volume use volume update command specifying its <Volume-Id> . You can set a new name for a volume and increase its size. For instance, to set name “new_name” and size of 100Gb for a volume vol-0123456789abcdef do:

$ portal volume update vol-0123456789abcdef -n new_name -s 100

You can delete a volume with:

$ portal volume delete <Volume-Id>

V. Channels

To synchronize files between a Spot Instance and your local machine define channels in the portal specification. Synchronization is done continuously using rsync and should be started explicitly with a command. Every channel is either inbound (files are moved from remote to local) or outbound (files are moved from local to remote).

For instance, you may edit scripts locally and configure a channel to send them to the remote instance after every save. You might configure another channel to automatically get some intermediate results from the remote instance to your local machine for preview. Specification of channels for such use case might look as follows:

"channels": [

{

"direction": "out",

"local_path": "/local/source/code/folder/",

"remote_path": "/remote/source/code/folder/",

"recursive": true

},

{

"direction": "in",

"local_path": "/local/results/folder/",

"remote_path": "/remote/results/folder/",

"recursive": false,

"delay": 10.0

}

]

Start syncing specified folders with:

$ portal channel <Portal-Name>

Remarks

As its author I would appreciate any feedback you might have regarding Portal Gun. And you are more than welcome to contribute!

As of the time of writing Portal Gun only supports AWS; however, there are plans for GCP (Preemptible VM Instances) and Azure (Low-priority VMs). Maybe even a recently announced service called Vectordash, if it takes off and provides required API.

Another desired feature is experiments which have to be logged and run until completion, no matter what.

Resources

Tags