Over the last couple of years we have started moving many of our workloads onto the AWS cloud. This has primarily been any of the new applications, websites or micro-services we were building. The unfortunate reality was that the bulk of Domain, including consumer website, the backend to our mobile apps, our primary database and many other core apps were still running in our data center.

As we gained experience with AWS we realised there would be major benefits to shifting the remainder of Domain’s infrastructure to the cloud. So, while continuing to break down our application into micro-services where possible, we kicked off a piece of work to “lift and shift” Domain to AWS. This was a massive undertaking on the team and everyone was up to the challenge.

The Dinosaur

Domain’s foray into the cloud began when the team started to build out micro-services to create a smarter and more flexible architecture. This method has been successful as those systems have proved to be scalable, performant and reliable . However, if we were to continue down this path it would have taken at least a couple of years to complete the migration.

A large subset of our applications were sitting on ageing hardware. This included our 250 GB SQL database, mobile web services and web infrastructure and a huge fleet of offline processing tasks. The on premise infrastructure was due for a hardware refresh this year and instead of investing heavily on new hardware we decided to focus our efforts on moving everything into the cloud. Wasting manpower to change tapes and replace failed disks in data centers is now an outdated concept.

Due to the age of the hardware and use of legacy software (IIS 6) we were experiencing high error rates and high response times during peak periods. This adversely affected the number of after hour calls received by the team, with the number of monthly calls now averaging 45. New Relic (one of our primary monitoring systems) showed the response time for both our mobile web services and desktop environment before the cloud migration to be hovering around 150ms and 210ms respectively. This was during our nightly peak.

On premise mobile web services response time

On premise desktop response time

Migration Challenges

Some challenges we were expecting to face during the migration included:

Not knowing where the dependencies were with other systems. Domain had grown organically within Fairfax and over time many shortcuts were taken to gain access to various data around the network. Decoupling our shared database infrastructure from the central fairfax team. Wondering whether it would be even possible to virtualise our current database environment. We needed to come up with a design to ensure we could run this large SQL workload in the cloud. This is the most critical part of the migration – it will determine whether this cloud migration moves forward or not.

In order address the issues and challenges mentioned above we had to define the strategy, plan the process, build the infrastructure, test and cut over the application without causing major delays or outages to domain.com.au. The whole process needed to be seamless.

Planning

We started by assembling a dedicated team of developers, devops and database engineers who could focus 100% of their time on this project. There was roughly 6 engineers in total working on the project throughout.

An offsite planning session allowed us to further break the tasks using tools like “smartsheet”, jira and lucid charts.

The AWS migration scrum board was born. We used Jira to track our weekly work and this allowed us to set achievable goals, track progress, and estimate the completion date.

The final step of the planning phase was to engage our AWS Enterprise and Technical account managers early on in the project. Early engagement meant a quicker response from AWS when we hit roadblocks during the migration as well as during the critical stages of the project such as “Go Live” dates. Having this support ensured a seamless transition process during the cut over phase from on premise to AWS.

The Solution

Now that we had everyone on board we were ready to start. From our previous experience we already had the foundation to build out the infrastructure in place. But there was still plenty to learn, such as how to run an FTP server on AWS and how to do canary deployments onto a dynamic set of servers.

The first step was to build a virtualized database (MS SQL) environment to replicate the on-premise environment. We built the environment across 2 availability zones then set up an array of machines to mimic production traffic. This included scripting the creation of 16 replay clients we could use for load testing. This allowed us to play back a production workload, and meant we could be confident in the infrastructure. We knew if CPU reached 70% or above it was “game over “. During peak loads our on premise DB was hovering around 25%. We needed to achieve somewhere between 30 – 40% CPU to ensure site performance was not compromised. After weeks of replaying production load and tweaking our configuration we hit the target figure. All results were positive and we knew the answer to our earlier question – yes we could run a virtual instance of domain db in the cloud.

The table below shows a comparison of the database infrastructure between on-premise and AWS. We were able to drop the number of cores from 32 to 16 and still get CPU to 30% during peak load. This also meant a reduction in our Microsoft SQL 2014 licence cost. We no longer had to pay for 32 cores when we could power domain in the cloud at fraction of the licence cost.

Comparison Table

Measure On Premise AWS CPU count 32 16 Memory 64 GB 90 GB Average CPU (Peak Period) 25% 30%

Gaining confidence with our database environment was a major milestone in our cloud migration. We were now feeling confident to flick the switch. When we cut over from our on premise SQL environment to AWS it was seamless and we recorded no blips/outages in our monitoring. Site Confidence (another monitoring system) below demonstrates the moment before midnight when we switched across without a single outage. In the words of our CTO (Mark Cohen) – “It’s like you did a heart transplant on someone between heartbeats and nobody noticed”.

To achieve this in a short period of time would have been near impossible in the past when we were still in the data center. Setting up 16 replay clients, building the infrastructure in record time and switching from physical to virtual would have come at a cost. What we achieved in weeks without causing an outage to the business was a major achievement.

The Final Challenge

Now with the backend infrastructure in the cloud the team set out to move the front end (desktop and mobile web services servers). We were able to run the DB on AWS while leaving the web servers on premise by using Direct Connect. This hybrid setup allowed us to connect directly to AWS through a private network connection without causing any major disruptions to our production environment. Our on premise infrastructure was running Windows 2003 IIS 6 for both web and mobile web services which was considered legacy. Given the current issues and the large volume of after hour calls the team was getting it was like being at knife’s edge thinking when the next outage would be. On average we were taking about 12 – 14 calls a week. Also, during peak our server side response times were hovering around 150ms and 200ms for both mobile web services and desktop.

After building out the new website infrastructure – this time on Windows 2012 R2 Server Core and IIS 8.5 – we were now ready to push live traffic to the AWS cluster. As we cut over both mobile web services and website environments there was a significant drop in server side response times. The cut over was carried out using a weighted DNS instead of a big bang approach. This was to mitigate downtime and allowed for a quicker rollback. The final step was to get AWS to pre warm our Elastic Load Balancers so that we were able to handle a sudden spike in traffic. Once this was completed we used weighted load-balancing DNS to do a gradual cut over to AWS. The table below shows the gradual cut over across AWS and On-premise.

On Premise Weight AWS Weight Split (%) 9 0 0% 9 1 10% 9 3 25% 9 9 50% 3 9 75% 0 9 100%

DNS weighting

The results was amazing. As we kept moving the percentage of traffic across to AWS the server response time followed a downward trend. The final number was a 50% decrease in server side response times. The graph below shows an increase in DB response time – this was an expected behavior since we had desktop still on premise. This improved once we migrated the desktop environment into the cloud.

Server side response time

The graph below shows a comparison of server side response times dropping to around 50ms from the previous week where it was averaging 100ms.

How it looks today

The graphs below shows the response time for both mobile web services and desktop during the peak period.

Mobile web services response time

Desktop response time

Due to the nature of our new scalable, self-healing infrastructure we have seen a dramatic decrease in the number of calls coming through to the on-call engineer. We have also been able to optimise our deployment process and reduce the time is takes to push a new version of domain.com.au live by 50%.

The investment in time for us to complete this migration has already started to pay dividends, especially when you start looking at it from beyond a cost perspective. At Domain the benefits we see right now are:

Fully automated environments that scale with our traffic patterns and heal themselves when an issue is found

A reduction in application error rates

Latest server and IIS version (Server 2012 R2 Core and IIS 8.5) – moving to this platform has provided greater stability in the infrastructure. Combined with the AWS infrastructure this has allowed us to Reduce after hour support calls by 83% improve server side response time for desktop – 54% faster improve server side response time for mobile web services – 55% faster



Latest version of SQL 2014 Enterprise – this has allowed us to use features like AlwaysOn Availability Groups, allowing us to offload read-only workloads to secondary replicas

The ability to replicate production environments in UAT and staging easily.

A modern platform to innovate on

most importantly…. a happier devops team!

Overall this has been a successful migration as we have resolved our issues and overcome challenges. The speed with which this migration was completed is a testament to the AWS infrastructure and the talented technical team at Domain that made it happen.