Building Highly Secure Infrastructure

It starts at designing the networking for ones Infra. The crucial part includes designing VPC and subnets. Here we have used the DMZ security pattern.

DMZ Network design pattern

One public subnet for bastion host and everything else under private subnets i.e application and database instances. Bastion host will be accessible via SSH from the internet and from bastion host you can connect to all other instances in private network. Refer this blog for bastion setup.

High Availability — Application as well as Database

For every AWS region, there could be two or more availability zones. One can leverage this to make services and databases highly available. For every service, create subnets in each availability zones for a region.

For example in ap-south-1 region, have two availability zones(ap-south-1a and ap-south-1b). For every application and database replica, create subnets in both the regions. Make sure each service’s instance resides in every availability zone. For this ECS placement strategy comes into the picture. Below strategy setting is used to achieve high availability

spread(attribute:ecs.availability-zone), spread(instanceId)

This ensures services must be placed across availability zones and a single instance, only one type of service is deployed.

Used MongoDB as a database. MongoDB Replicaset consists of a primary, a secondary and an arbiter. These instances fall under different availability zones in a region.

It helps one to overcome the complete outage of single availability zone within a region.

Hassle-free Infra Mirroring Across Regions

We use Terraform for managing the Infrastructure. Following things are provisioned using Terraform for Workbench.

VPC

Subnets

ALB

NAT Gateway(Single public IP for the external world)

Internet Gateway

Instance for Bastion Host

ECS Clusters

MongoDB database instances

We followed the modular approach for writing the Terraform code. Built modules for every resource in AWS and reused these modules to create the Infra. Its a layered architecture consists of three layers

Base layer : Modules for every resource

: Modules for every resource Infrastructure layer : Actual infrastructure code to provision the Infra by reusing base modules

: Actual infrastructure code to provision the Infra by reusing base modules Configuration layer: Contains environment wise or region wise configuration to provision the Infra. We have three environments staging, pre-production, and production. For production, you could have region wise configuration as well.

So by following this design, all you have to do is just create the configuration for either environments or regions, run the terraform using respective configuration and your infra is ready. These layered design of the code allows us to mirror the complete infrastructure to other regions in minimal efforts.

Monitor & Measure: Infra Toolset at Fynd

We need tools to monitor, measure infrastructure and applications. There are plenty of open source and paid tools available in the market to do the same job. But choosing the right tools for the right purpose among all needs a lot of efforts like benchmarking etc. These are the tools we decided to use for different purposes post filtering other tools.

System Monitoring

System Metrics — New Relic Infrastructure

New Relic Infrastructure: Used for monitoring CPU utilization, the load on the server, memory utilization, disk space usage etc.

Application Monitoring

Application Metrics — New Relic APM

New Relic APM: To monitor the application’s performance, response time, external service’s response time, database response time etc.

New Relic Synthetics: For service up-time.

Prometheus: For code instrumentation and to generate custom metrics.

Sentry: For exceptions notification in the code.

Self Healing Mechanism

Standalone Services: Used Monit for auto recovery of databases and applications deployed on EC2 instances if they go down.

For example, we configured Monit to monitor MongoDB PID file and mentioned commands to start/stop/restart MongoDB, when the MongoDB goes down for any reason then the Monit does not find PID file which it was monitoring. Monit recognizes that the service is not up and starts the service without any manual intervention.

ECS Services: ECS has out of the box auto recovery support for applications. It keeps on monitoring the applications health check endpoint and if it does not respond within predefined time than its kills and spins the new service in 60 secs.

Dashboards

Grafana dashboard

Prometheus collects the service’s metrics and it has been integrated with Grafana for dashboarding.

Alerts

Pagerduty, Slack & Email for critical and warnings.