Overview

We spent $65,589.59 on AWS in September. This is up 1% from August and is 4.3% of MRR in September. The increase comes from adding two Cassandra clusters and two ELK stacks. We saw a decrease in spend on S3 due to a few changes we made in the way that we serve files in emails.

High-level breakdown:

EC2-Instances - $21,058.39 ( +8% ) Relational Database Service - $18,780.04 ( -1% ) S3 - $9,951.83 ( -10% ) EC2-Other - $5,395.91 ( +7% ) Support - $4,393.24 ( 0% ) Others - $6,010.16 ( +2% )

EC2-Instances - $21,058.39 ( +8% )

We deployed 4 new clusters into our infrastructure in September:

staging ELK stack ($240)

secondary staging Cassandra cluster ($240)

production ELK stack ($950)

secondary production Cassandra cluster ($3000)

These clusters aren't cheap so the fact our total bill only increased by 1% is awesome. We got a lot of replies to our post last month about spot instances. We use spot instances in our staging environment. The problem we have with spot instances in production comes from our workers. We run billions of jobs with sidekiq and we don't want the jobs to get interrupted if we can avoid it. We're looking at options to shut down workers in response to spot notifications and we could run our web workloads with spot instances. However, these are optimizations we haven't made yet.

Service breakdown

USE2-HeavyUsage:i3.2xlarge - $6,471.36 ( -3% ) These are our Cassandra and Elasticsearch clusters.

We use Cassandra to store massive amounts of data.

We use Elasticsearch to search through massive amounts of data.

We'll pay more in October because we bought more reservations at the end of September and beginning of October. USE2-BoxUsage:c5.2xlarge - $3,309.63 ( +82% ) Our use of these has increased since the beginning of September.

Many of these are web instances and we could swap them out for spot instances.

To decrease this bill we need to use spot instances or reserve more instances. USE2-HeavyUsage:c5.2xlarge - $1,866.24 ( +23% ) These are the 12 c5.2xlarge instances that we reserved last month.

This is the smallest number of instances we need running to serve our web traffic.

Between our c5.2xlarge on-demand and reservation bills, there is a lot of room to improve and save costs. USE2-BoxUsage:t3.medium - $1,465.55 ( +13% ) We use these instances for workers that experience bursts of use throughout the day.

Containers will allow us to merge these workers into machines with shared resources. Doing so should reduce this bill. USE2-DataTransfer-Out-Bytes - $1,438.83 ( +9% ) This is the cost of our apps to communicate with each other as well as the internet

This is also affected by our backup strategies. We're running normal backups as well as disaster recovery backups.

It's affected by our logging patterns. We send our logs out to a third-party service so our egress costs increase as the amount of logs increase.

The secondary Cassandra clusters will cause this bill to increase. USE2-BoxUsage:t3.xlarge - $1,180.55 ( +6% ) We started using more of these in mid-September.

We can break these workers down to start consolidating the work into other machines. When we do that we should see a decrease in this cost.

Relational Database Service (RDS) - $18,780.04 ( -1% )

Our RDS bill is pretty stable. We have no plans to make any significant changes to this service so we can expect the cost to remain mostly flat. There was a misconfiguration at the beginning of September. Our backups didn't run for a few days, so, we weren't charged for those days. We can expect a higher bill in October because we fixed our backups.

Service breakdown

USE2-HeavyUsage:db.r5.12xl - $4,790.02 ( -3% ) This is our MySQL master database.

We have it reserved so we can expect to pay this amount every month for the next year.

It's down because September is shorter than August. RDS:ChargedBackupUsage - $2,804.17 ( -12% ) This is the cost of the backups we run for our disaster recovery account

We had issues with backups not running at the end of August. We fixed the problem but it is not reflected in the bill until a few days into September. Here's the daily spend.

We can expect an increase in this bill for October because we fixed our backup problems. USE2-InstanceUsage:db.r4.8xlarge - $2,764.80 ( -3% ) This is our on-demand MySQL replica.

We haven't committed to reserving it because we won't need it for very much longer.

As long as we keep this around, we can expect to pay about this much every month.

The cost is down because September is shorter than August. USE2-RDS:ChargedBackupUsage - $2,616.73 ( -2% ) These are the normal, non-disaster-recovery backups

These run every day and increase with database storage usage. The bigger our database, the more we can expect these backups to cost.

Our daily spend has been flat and should remain flat for the next few months. We have about 850GB of free storage left in our database and we consume about 200GB of storage per month. This cost should hold pretty steady until Q1 2020, when we need to add more storage. USE2-HeavyUsage:db.r4.8xlarge - $1,596.67 ( -2% ) This is the reserve price for one MySQL replica.

We reserved this in July so we'll continue to pay this much for the next year.

It's down because September is shorter than August. USE2-RDS:Multi-AZ-GP2-Storage - $1,691.64 ( +20% ) This is the cost for us to have storage available for our database in multiple Availability Zones.

Multiple Availability Zones are useful in case there is a service outage in a single Availability Zone in a Region

Our database has 5400GB of storage capacity on GP2 drives. That storage space is available in three different Zones in our Region.

The increase is due to extra storage added at the beginning of September. USE2-RDS:GP2-Storage - $1,269.60 ( +22% ) We increased our storage in September to accommodate growth.

S3 - $9,951.83 ( -10% )

We began seeing cost savings for our data transfer out of S3 after migrating S3 and CloudFront requests to Cloudflare. This already had a big impact on our September bill and we'll continue to see further wins in our October bill.

Service breakdown

USE2-DataTransfer-Out-Bytes - $4,235.06 ( -15% ) We started serving our new uploads from a Cloudflare CDN at the end of September.

The CDN reduced our transfer out from S3 almost immediately. DataTransfer-Out-Bytes - $3,607.07 ( -9% ) We moved this legacy bucket behind Cloudflare earlier but we took more steps in September to cache assets more aggressively.

Using a more aggressive cache on legacy assets has started driving down our spend. USE2-TimedStorage-ByteHrs - $1,342.09 ( +5% ) The increase comes from fixing the bug that disrupted our backup schedules.

We can expect this bill to remain mostly flat for the foreseeable future.

EC2-Other - $5,395.91 ( +7% )

We started logging more in September. As we log more we will be sending more data to our third-party logging provider. Some of that data will have to go through the NAT gateway. We are working on this by moving our log management in-house with the ELK stack.

We've gotten feedback about our NAT gateway from our last post. Our NAT gateway has IP addresses associated with it that are years old. Those IP's are used to help determine our reputation with email providers. We can migrate those IPs to new infrastructure to reduce bandwidth costs but we have to be careful doing so. We have to ensure that our IPs remain intact and reputable. This is a problem that we'll solve in the future but it hasn't been a high enough priority to warrant action.

Service breakdown

USE2-NatGateway-Bytes - $1,706.51 ( +20% ) Our backend services use the NAT gateway to communicate with the internet.

We have a third-party logging provider so all logs go through the NAT gateway and we've been logging more.

Once we move away from the third-party we should see some decrease in the cost of this bill. USE2-EBS:VolumeUsage.gp2 - $1,425.51 ( +1% ) EBS volume costs correlate with the number of EC2 instances we're running.

We added the new clusters I mentioned above which accounts for the increase in cost.

We can expect to pay more in October. USE2-DataTransfer-Regional-Bytes - $1,386.04 -2% ) We started replicating more data between us-east-2 and us-east-1 which lead to an increase in our daily spend. We can expect this cost to increase in the future.

The cost is lower because September is a shorter month.

Support - $4,393.24 ( 0% )

Support has tiered pricing. We pay for business support for both our production account and billing account. Business support on our billing account may be unnecessary because we use our production account for almost everything.

Service breakdown

7% of monthly AWS usage from $10K-$80K - $2,780.66 ( +2% ) This is the cost of only our production account. 10% of monthly AWS usage for the first $0-$10K - $1,612.58 ( -2% ) This is the cost of our production account and billing account.

We could save money by turning off support for our billing account.

Others - $6,010.16 ( +2% )

These are the services that make up the rest of the bill. In this group of services, the two biggest are CloudFront and EC2-ELB. After these two services, everything else gets much cheaper.

Service breakdown

CloudFront - $2,160.87 ( -2% ) This is the cost of our attachments and uploads that get delivered to a subscriber's inbox.

The more emails we send, the more this bill will cost.

We migrated to Cloudflare at the end of September so this bill should be pretty much gone in October. EC2-ELB - $1,360.29 ( 0% ) We have load balancers sitting in front of our web workloads. As we handle more requests, we'll get billed more due to bandwidth.

Despite having different daily spends and different usage type breakdowns, August and September billed the exact same amount and that's pretty neat.

Conclusion

Containers and orchestration are the next big step for ConvertKit infrastructure. We migrated to docker in production at the beginning of October and we're on the path to Kubernetes. Sharing cloud resources among our applications will allow us to cut down the amount of idle CPU and RAM. It will also make our infrastructure much more elastic. EC2 spend is one of the next bill optimizations that we'll need to make. We can begin to solve it with containers, Kubernetes, and the right combination of reserved and spot instances.

Looking back at the progress we've made in the last 6 months keeps me excited for where we will be and the work we'll get to do in the next 6 months. We have worked hard to get the bill to a more predictable state. We will continue to optimize our current infrastructure while still building for the future.

