As you may know from our previous posts, we started using Amazon’s Web Services heavily about nine months ago. It has been a great learning experience for all of us so far, but what is nice is when all this hard work comes together when it is most needed.

Today I just wanted to talk about how AWS saved the day for us, in ways that wouldn’t have been possible with our previous “on-premise” setup.

Context

At Domain, one of our most important asset is images. When people look for properties, they want to see what it looks like, and for real estate agents the bigger and the higher quality photos we display, the better.

Recently we rebuilt our image processing infrastructure to work in the cloud. We followed the microservices strategy that has worked so well for us in the last few months. There was initially not much to that service: it gets the image binary, some metadata and then it gets saved to Amazon S3. This is what it looks like in a nutshell:

So at the time an image gets uploaded, we resize it into a few different sizes (the most commonly used ones) and store that to be served by our apps. This is nice because when an image is displayed, we don’t have any computation to do. Just request your image, and there is a good chance it would be in our CDN’s cache anyway.

However, we still want to serve the best images to our users, so we needed some changes. On mobile devices – especially Android due to the number of different devices and screen resolutions available – the ideal image size for a given spot will vary greatly. So what our developers came up with is to pass the size of their image container to the image service as a parameter, which will then issue a redirect to the most appropriate image size. This requires a certain amount of processing, which is new on this service – previously we were only uploading to this service, never reading. But surely we could manage.

So once this was implemented, we pushed the changes to the service live and got the Android team to enable the feature to 10% of users. We’ll talk about this in a future post, but feature flagging/toggling is really useful. If you are not using this concept extensively yet, think about it. It’s worth it!

“Maybe your best course would be to tread lightly”

A few days later, everything was behaving really nicely, so our Android team decided to crank it up to 11 and enable the feature for 100% of their users.

By now you probably know where this is going. People started updating their apps, and by the evening when peak time hit we started having issues. Images were loading on mobile devices but very, very slowly.

Looking at NewRelic (our monitoring tool of choice) we could see that the response time was very inconsistent across requests, but then again, maybe it was the writes causing that. Reads should in general take 10ms but the writes (uploads) can take up to 300ms. This time we were even seeing some requests taking up to a few seconds.

It didn’t take long for our users to let us know either. The thing with products today is if you have an issue, the app store and social media provides an immediate feedback channel for people, which can impact ratings. Bad reviews/ratings started coming in the Android Play store.

“Yeah, science!”

After a bit of digging around, we found out that we had a problem with our data store. We are using Amazon’s Dynamo DB behind the scenes to store image metadata and available sizes. It’s nice in that you just need to provision some capacity (reads and writes per second) and then pay for the throughput. But even though this product can in theory scale up and out to infinity and beyond, there is apparently no built-in way of provisioning more capacity automatically based on load.

Looking at the Dynamo DB dashboard it was quite evident that our requests were getting throttled. Not to worry, we’re in the cloud. We just had to change our provisioned capacity (we quadrupled it) and a few minutes later we were back in business.

The following day we decided to solve the problem once and for all. Adding capacity to the data store helped but we were still seeing some inconsistencies, especially when the web cluster hosting the image service was scaling up and down.

The progress we were able to make in one day was quite impressive, and was only possible due to the power of cloud computing.

Here is what we ended up doing IN ONE DAY:

Scaled up the size of our servers – specifically, we changed our EC2 instance types from M3.Medium to C4.XLarge. This means that our image service cluster would scale less often, but also that we would still have a decent amount a time to scale up when the 75% CPU is reached on this cluster.

Added more servers to the cluster at a minimum, again to minimise the scale up events until we find out where our limits are. We can always scale back down later (and we did).

Separated the reads and writes into 2 different auto-scaling groups. We were able to spin up a secondary cluster hosting the same image service in about 10 minutes, thanks to the hard work that went into our Robot Army v2 . All it took is 5 lines of configuration in a Json file and wait for the machines to boot up. Then we had to update our uploading apps to use the new cluster, but thanks to Octopus Deploy , this was just a matter of changing a couple of variables and trigger a re-deployment.

. All it took is 5 lines of configuration in a Json file and wait for the machines to boot up. Then we had to update our uploading apps to use the new cluster, but thanks to , this was just a matter of changing a couple of variables and trigger a re-deployment. Our devs contributed about 4 or 5 performance improvements to the application throughout the course of the day, in production . The fact that we are able to very quickly deploy and test our changes in our staging environment meant that we were also able to confidently push those improvements to production throughout without taking any of our developers’ or devops engineers’ time for deployments.

. The fact that we are able to very quickly deploy and test our changes in our staging environment meant that we were also able to confidently push those improvements to production throughout without taking any of our developers’ or devops engineers’ time for deployments. Using JMeter, we were able to load-test our service with loads similar to production (now that we know what production load looks like) between every change to make sure there was at least no regression.

This is what our setup looks like after these improvements:

So with the power of the cloud and automation, we were able to transform a struggling service into a very performant and scalable one in one day. If this service was still in our data centre it definitely wouldn’t have been possible in such a short timeframe. Sure we are using virtualisation and we can still spin up new VMs, but it would have taken at least a few more days to sort it out. Also, scaling up/out databases is always much harder than just spinning up a new machine, so it would have involved data migrations and possibly downtime.

Anyway, this is how our image service (for reads) has been performing over the last 3 days:

So that’s a very consistent 10-12ms average response time, even though the throughput can vary significantly:

“Apply yourself and respect the chemistry”

This is all well and good and our service is performing as it should again in less than 24 hours. This incident definitely confirmed to us that our cloud and microservices strategy is what we need.

We could also have done much better and avoided all that with better planning and a few extra checks. So what have we learnt?