Spot Instances — zero to hero in a couple of paragraphs

You should definitely read this part if you’re new to Spot Instances or if you’re not caught up with the major changes in the pricing model (no bidding, no price spikes) that were introduced in November 2017. Same for if you’re not familiar with how EC2 Auto Scaling groups allow to mix purchase options and instance types starting from November 2018.

Spot Instances are spare EC2 compute capacity that you can use at up to 90% discount compared to the On-Demand price. As opposed to On-Demand and Reserved Instances purchase options that have static prices (unless AWS reduces their price), Spot Instances’ price could fluctuate in very small intervals only a few times a day according to long-term supply and demand in each capacity pool (a combination of instance type in an availability zone) separately.

Spot pricing history for the last 3 months show r4.xlarge in US East (N. Virginia) with very small price differences between the availability zones (less than ~$0.002), and the Spot price actually only changed during those last 3 months in some of the availability zones

The one absolute crucial best practice you should take away from this section is flexibility and diversification. If you’re running your stateless, fault tolerant, distributed workload (examples: a cluster of web/application servers behind an ELB, batch processing that consume jobs from a queue, or container instances/worker nodes) using On-Demand (with RI, Savings Plans, or none of those) on a specific instance type because you qualified that instance or just started using it successfully, when you start adopting Spot in your applications you need to diversify usage to as many capacity pools (instance types in AZs) as possible.

The reason is simple: by using multiple capacity pools you: (a) increase the chances of getting Spot capacity, instead of trying to get Spot capacity from a single instance type in a specific AZ and failing if there’s no spare capacity in that capacity pool. And (b) if EC2 needs the capacity back for On-Demand usage, typically not all capacity pools will be interrupted at the same time, so only a small portion of your workload will be interrupted, only to be replenished by Auto Scaling groups (or Spot Fleet, we’ll get to that later) thus avoiding any impact on your availability, or the need to run On-Demand and pay more.

For example: if I’m running my application on c5.large today, I can probably also run it on m5.large — in most cases the application will work just fine if the Operating system simply sees more memory and a slightly slower CPU clock speed (unless we’re talking about CPU sensitive workloads or ones that use specific instruction sets like AVX-512). Similarly, you can use r5.large with even more memory, and go back a generation and also use c4.large / m4.large / r4.large. This concept works just fine with the Kubernetes scheduler, but requires some adaptations when starting to talk about autoscaling in Kuberentes — a topic we will dive very deep into later on in this post.

Also take a look at the allocation strategies which are supported by EC2 Auto Scaling groups. To follow the diversification best practice and spread your nodes/pods across as many capacity pools as possible, typically what customers have been using is lowest-price allocation strategy with a number that is equivalent or n-1 to the number of instance types you selected. However, the recommended approach is to use the capacity-optimized allocation strategy, launched in August 2019. This allocation strategy will choose the Spot Instances which are least likely to be interrupted, as it targets the deepest capacity pools. Read more about the launch of the capacity-optimized allocation strategy here.

So how does AWS make it easy to follow these instance type flexibility best practices? enter EC2 Auto Scaling groups.

EC2 Auto Scaling groups

If you’re confused about ASGs vs Spot Fleet or EC2 Fleet for your Kubernetes clusters, don’t be. These tools or APIs have similar traits, but I’ll make it very simple for you. Today Fleets are more suitable for large scale jobs that have a beginning and an end — for example, I need 3000 vCPUs to process my videos or images on S3, or to run a nightly Hadoop/Spark job, or any other type of batch computing job. While Spot Fleet does have Auto Scaling capabilities which are very similar to ASG’s capabilities, ASGs are better for workloads that run continuously as part of a service and do not need to reach a finish line. Some of the benefits of ASGs for these types of workloads include: lifecycle hooks (which will allow you to easily drain your container instances when there’s a “scale-in” activity, we’ll touch on that later in the post), protecting an instance from scale-in, attaching or detaching instances to/from an ASG, terminating a specific instance from an ASG, ELB Healthcheck integration, balancing the number of instances in each availability zone. Also, and most importantly for our topic at hand: community driven tools are integrated with EC2 ASGs — for example eksctl, Kubernetes cluster-autoscaler, Kops and others.

Here’s an example of creating an EC2 Auto Scaling group from the AWS Management Console. Note that you don’t actually have to do this when using eksctl or kops, because these tools set up the ASGs for you.

Creating a new EC2 Auto Scaling group in the AWS console. I selected 7 instance types that have similar vCPU and Memory specifications, and will run this cluster in 3 availability zones, for a total of 24 capacity pools, thus increasing the chance that I’ll be able to get my desired Spot capacity, and also increasing the chance of keeping the desired capacity if some pools are interrupted when EC2 needs the capacity back. The Spot allocation strategy is capacity-optimized.

And finally: how can I know which instances are best for my Spot usage? Use the Spot Instance Advisor tool to check the historical interruption rate (in the last 30 days) of each of the instance types in your region of choice. Our Kubernetes cluster is going to be set up to be fully fault-tolerant to Spot interruptions by catching the Spot 2-minute interruption warning and draining the worker nodes that are going to be terminated, but it’s still a good idea to focus on instance types with lower interruption rates.

Two considerations for Auto Scaling groups around Multi-AZ:

If you use Persistent Volumes with EBS then you’re going to need to run your node group / Auto Scaling group for that application in a single Availability zone, because EBS Volumes are zonal and you don’t want your pod to be scheduled on the wrong AZ where the EBS volume does not exist. If your use-case allows for using Amazon Elastic File System (EFS) which spans multiple-AZs, then you can ignore this limitation and run your pods that work against the EFS mount in multiple-AZs, and same for if you use Amazon FSx for Lustre.

Auto Scaling groups strive to keep the same number of instances in all the AZs it’s running in. This might cause worker nodes to terminate when ASG is trying to scale-down the number of instances in an AZ. If you run a tool to automatically drain instances upon ASG scale-in activities (which we will touch on later in this post) then you shouldn’t worry about this. Otherwise, you can simply disable this functionality by suspending the AZRebalance process, but you’re risking a situation where your capacity will become unbalanced across AZs. Note that cluster-autoscaler itself picks the instance to terminate, so this ASG scale-in concern is not relevant for the operations of cluster-autoscaler, but only for the AZRebalance process.

Two last words about Mixed ASGs: Launch Templates. These are a requirement if you want to run an ASG with multiple purchase options and instance types. Conceptually, these are the same as Launch Configurations in that they allow you to configure things like the AMI, storage, networking, User data and other settings, and use the template to Launch an instance, or an ASG, Spot Fleet, or use in AWS Batch and possibly more services in the future. Launch Templates are also more advanced because they support versioning, but we won’t dive deep into LTs. Read the docs if you want to learn more.

If you want to get hands-on experience with running stateless applications on EC2 Auto Scaling groups (not necessarily with Kubernetes), have a look at https://www.ec2spotworkshops.com

I highly recommend the Spot / Kubernetes deep dive workshop that will help you implement the best practices described in this article and achieve more learning points along the way https://ec2spotworkshops.com/using_ec2_spot_instances_with_eks.html