Manually deploying this type of data engineering pipeline onto AWS (or any other cloud provider) is both difficult and tedious because you have to provision the servers, install and configure the software for each piece of the pipeline, and finally you have to make sure each component can properly communicate. Without automation, this process has to be manually repeated every time the pipeline is deployed with no guarantee of immutability. A much better approach is to automate and version control the deployment by writing the infrastructure as code (IaC). This concept fits within the DevOps mindset of treating operations as an engineering problem.

Automating the Process

The tool of choice for IaC was Hashicorp’s Terraform since it has many AWS integrations and is one of the most popular open-source options with a lot of community support. Terraform provisions the infrastructure on AWS, but the servers are still not configured. To automate this process, I chose Packer which allows me to create images that have the software I need pre-installed. The Packer input files for this project can be found here. Terraform can take these images and use them when provisioning servers. For example, the following Terraform input is used to create the Postgresql server where ${var.AMIS} refers to the Packer-generated Postgresql image:

resource "aws_instance" "postgresql" {

ami = "${var.AMIS}"

instance_type = "m4.large"

key_name = "${var.KEY_NAME}"

count = 1

vpc_security_group_ids = ["${var.SECURITY_GROUP_ID}"]

subnet_id = "${var.SUBNET}"

associate_public_ip_address = true

root_block_device {

volume_size = 100

volume_type = "standard"

}

tags {

Name = "postgres-${var.SUBNET_NUM}"

Environment = "dev"

Terraform = "true"

}

}

Terraform organizes the code for each component of the data engineering pipeline into a module which increases code readability and reusability. A total of six modules were created for this project:

aws which sets up AWS networking e.g., security groups, virtual private cloud, subnets, and route tables. elb which creates the AWS elastic load balancer for distributing traffic across multiple availability zones (more on this later). flask which creates the AWS auto-scaling group for controlling the number of Flask servers based on user demand (more on this later). postgres which sets up the Postgresql servers. prometheus which creates the Prometheus server for monitoring system metrics (more on this later). spark which sets up the Spark cluster.

In short, each module is made up of the following:

main.tf which defines the AWS resources to be created.

which defines the AWS resources to be created. output.tf which defines module output variables.

which defines module output variables. variables.tf which defines module input variables.

which defines module input variables. Bash scripts for software configuration.

The simplest module is postgres where main.tf was shown earlier. The output.tf file shown below outputs the Postgresql server DNS which is used by the Flask servers to connect to the database:

output "PRIVATE_DNS" {

value = "${aws_instance.postgres.private_dns}"

}

The variable.tf file is shown below reads in the required variables to provision the Postgresql server:

variable "AMIS" {}

variable "KEY_NAME" {}

variable "SECURITY_GROUP_ID" {}

variable "SUBNET" {}

variable "SUBNET_NUM" {}

I won’t go through each module in detail as each one is unique and involved enough to warrant its own article. However, I encourage the reader to explore the modules that are of interest to them on their own. The coordination of all the modules to deploy the data engineering pipeline is done by main.tf in the root Terraform directory. To create a truly one-click deployment, a script was written to build all of the Packer AMIs then run Terraform for deployment onto AWS.

Original infrastructure design to simply get things up and running

Improving Fault Tolerance

Once the deployment process was automated, I wanted to improve the fault tolerance of the pipeline. Currently, if any service crashes or slows down under the load, the whole pipeline is affected. There was no auto-scaling for the Flask servers to handle changes in user demand, and everything was hosted on a single availability zone (AZ), which made the service prone to experience blackouts in the event of a power outage. A more robust infrastructure design is shown below.

Revised infrastructure design to increase scalability and reliability

The first infrastructure change was to auto-scale Flask so that the number of Flask servers dynamically change based on user demand. This was done by enabling auto-scaling. I chose to create and destroy Flask servers based on the CPU usage, which appeared to be the limiting resource when running the Flask application. The tools of choice for this were AWS auto-scaling groups (ASG) and AWS CloudWatch. The entire implementation can be found in the flask module.

The second infrastructure change was to deploy the pipeline across two AZs in order to withstand a single AZ outage. This was accomplished by defining an AWS Elastic Load Balancer (ELB) to evenly distribute traffic across two AZs. The Spark cluster is not duplicated since we are only running batch jobs which do not require high availability. To demonstrate the increased scalability of the revised infrastructure, I used Locust to simulate a spike in user traffic, and Prometheus to monitor the resulting CPU usage of all the Flask servers.

CPU usage as a function of time following a spike in traffic simulated by Locust. Upper and lower CPU limits set by AWS CloudWatch to trigger AWS Auto-Scaling.

The above figure plots CPU usage for all Flask servers as a function of time across the two AZs. Locust is started around 20:41 and we immediately see a spike in CPU usage. The CPU usage surpasses the upper limit set by CloudWatch, which triggers the ASG to provision additional Flask servers until the CPU usage across all servers is below the upper limit. We see additional Flask servers begin to come online between 20:44 and 20:46 after which the CPU usage across all servers fall below the upper limit. Locust is turned off at 20:49 which causes the CPU usage across all Flask servers to drop below the lower limit. This triggers the ASG to decrease the number of Flask servers back down to one. Next, I demonstrate the ability of the revised infrastructure to withstand the loss of a single AZ. I start Locust again to spike the traffic then at the peak CPU level, I cut off access to one of the AZs through the ELB.

CPU usage as a function of time after a spike in traffic and loss of an availability zone

This time, Locust is started around 8:50 and we immediately see a spike in CPU usage. At the peak CPU level, I cut off access to the lower AZ through the ELB which causes its CPU usage to drop down to nearly zero (not exactly zero due to background processes). This results in all traffic to be redirected to the upper AZ which causes its CPU usage to surpass the upper limit. As in the previous example, this triggers the ASG to increase the number of Flask servers until the CPU usage across all Flask servers is below the upper limit. At 9:05, Locust is turned off and we see the CPU usage drop below the lower limit which triggers the ASG to decrease the number of Flask servers back down to one.