Adding EC2 Instance Recovery Alarms with CloudFormation

Instance Recovery is a little-advertised, little-used feature of EC2. It doesn’t take long to set up and promises to recover your instance on the rare occasion that the underlying hardware fails. Recovery resumes the instance on new hardware, retaining its instance ID, private IP addresses, Elastic IP addresses, and all instance metadata.

I’ve deployed it on “snowflake” instances that don’t have the luxury of using Auto Scaling Groups. This gives me a little extra uptime assurance. I don’t think I’ve actually ever seen any EC2 instance get auto-recovered though.

Maybe I’m cargo-culting it, but it’s not much work to set up, so it feels like an easy (potential) win.

Update (2019-09-13): I asked on reddit for examples and Redditron-2000-4 replied. They have 1800 EC2 instances and see 1-3 automatic recoveries a month. This is a failover rate of 0.05%-0.15%. Small but significant if you’re looking for even 99.9% uptime!

You can click to set up a recover alarm for an instance on the console as per the documentation.

I like automating with CloudFormation though. I converted the resulting manually-created alarm into a template snippet some time ago, and copy paste it between projects. Let’s take a look at it.

If you have an EC2 instance in your CloudFormation template’s Resources like so:

Ec2Instance : Type : AWS::EC2::Instance Properties : LaunchTemplate : LaunchTemplateId : !Ref LaunchTemplateId Version : !Ref LaunchTemplateVersion Tags : - Key : Name Value : Snowflake-Production

…then a basic recovery alarm for it would look like this:

Ec2InstanceAutorecoverAlarm : # Recover the instance if its EC2 status checks fail, as per: # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html#AddingRecoverActions Type : AWS::CloudWatch::Alarm Properties : AlarmActions : - !Sub arn:aws:automate:${AWS::Region}:ec2:recover AlarmDescription : Recover instance if its status checks fail. Namespace : AWS/EC2 MetricName : StatusCheckFailed_System Dimensions : - Name : InstanceId Value : !Ref Ec2Instance EvaluationPeriods : 2 Period : 60 Statistic : Minimum ComparisonOperator : GreaterThanThreshold Threshold : 0

The snippet EvaluationPeriods set to gives you 2 minutes of a failing instance before it’s recovered. This is as recommended in the manual setup documentation.

That documentation page also shows how to create alarms to stop, terminate, or reboot instances. I’ve not needed to do any of those, but if you do, you should be able to adapt this template snippet to match.

Fin

Hope this helps you recover,

—Adam

Working on a Django project? Check out my book Speed Up Your Django Tests which covers loads of best practices so you can write faster, more accurate tests.

Subscribe via RSS, Twitter, or email: Your email address:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: aws, cloudformation

© 2020 All rights reserved.