A management prof I knew had a habit of using a semantic trick to extract a definition from any phrase — flip the words around and hope for a eureka moment. Let’s see if it works out.

The Well Architected Framework is a framework to architect… well.

That was the first, and last, management class I took.

To get a bit more in depth, let’s explore what the Well-Architected Framework isn’t. It’s not a step by step guide to becoming an AWS infrastructure guru. It’s also not a guide for cloud developers to know which cloud services they should be using. You won’t get implementation details or architectural patterns.

What you will get is a set of questions and practices organized in five separate pillars. They’re meant to be kept in mind while developing and architecting for AWS. They’re meant to be a benchmark against which you can evaluate your infrastructure. It is the AWS prescribed way of making sure your cloud is compliant with best practices.

Let’s go over the pillars, and preview a few of the practices found within each.

Security

The Security pillar encompasses the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies.

AWS Security Pillar White Paper, May 2017

Pretty self-explanatory. You don’t want unauthorized users gaining access to your infrastructure. You don’t want to unwittingly leak thousands (millions?) of users’ data. You don’t want your company on the news and have to lobby Congress to pass a law to prevent consumers from suing you.

**cough** Equifax 2017 **cough**

Example best practice:

Identity and Access Management (IAM) — Do use an assumed identity, don’t use the root account for daily operations. If you really need to use the root account, use Multi Factor Authentication always.

Simple Storage Service (S3) — Do restrict access to your buckets to only the resources that need it. Don’t enable unrestricted access to the public or any logged in AWS user.

Reliability

The Reliability pillar encompasses the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.

AWS Reliability Pillar White Paper, November 2016

AWS is pretty reliable. 99.9%+ uptime is promised and delivered. However, for mission-critical resources, even a few hours of downtime a year can be devastating. Following the rules of this pillar will have you design your infrastructure to limit the impact “Act of God” failures can have on your workloads and users, and make sure your apps never go down.

Example best practice:

Relational Database Service (RDS) — Do set up automated backup of your RDS instances. Don’t let admin errors or catastrophes affect your data availability.

Elastic Load Balancer (ELB) — Do associate a minimum of two EC2 instances per ELB. Don’t let downtime on an EC2 instance affect user experience.

Elastic Cloud Compute (EC2) — Do spread out your EC2 instances across all Availability Zones. Don’t put all your eggs in one basket, on the off chance AWS drops it. 🐣

Performance Efficiency

The Performance Efficiency pillar focuses on the efficient use of computing resources to meet requirements, and maintaining that efficiency as demand changes and technologies evolve.

AWS Performance Efficiency Pillar White Paper, November 2016

The AWS platform changes extremely frequently, and at an ever increasing rate. (see graph) This means the services you know and love are constantly improving. An infrastructure created in 2011 will not be as efficient as one made in 2017, even if you were following best practices at the time. This pillar is meant for you to make sure that you are delivering a consistently speedy experience to your users, by upgrading your resources to the most up to date versions released by AWS, and following performance best practices.

Graphic borrowed from: @acloudguru AWS Developer Certification Course

Example best practice:

Elastic Cloud Compute (EC2) — Do make sure to upgrade EC2 instances that are overutilized. Don’t let CPU utilization average above 90%.

Elastic Cloud Compute (EC2) — Do ensure that all of your servers are running the latest version of EC2 instances. Don’t run servers using legacy products.

Cost Optimisation

The Cost Optimization pillar is used to assess your ability to avoid or eliminate unneeded costs or suboptimal resources, and use those savings on differentiated benefits for your business.

AWS Cost Optimization Pillar White Paper, November 2016

We consistently see AWS infrastructures running unused or underused resources. This can lead to wasted spending in the 50+% range. You’d be surprised by how common it is for AWS developers to spin up a resource for testing purposes, and then forget about it for months while it accumulates hundreds in useless spend. By following the principles and practices of this pillar, you’ll reduce your monthly spend and clean up unnecessary resources found within your infrastructure.

Example best practice:

Elastic Block Store (EBS) — Do ensure you’re running only EBS instances that you’re actually using. Don’t let idle EBS volumes balloon your costs unnecessarily.

Elastic Cloud Compute (EC2) — Do downsize underutilized EC2 instances to your usage level. Don’t needlessly pay for compute capacity that you’re not using.

Operational Excellence

The Operational Excellence pillar includes operational practices and procedures used to manage production workloads. This includes how planned changes are executed, as well as responses to unexpected operational events. Change execution and responses should be automated. All processes and procedures of operational excellence should be documented, tested, and regularly reviewed.

AWS Well-Architected Framework White Paper, November 2016

Large enterprises aren’t managed like startups, and large enterprise infrastructures shouldn’t be managed like startup infrastructure. Process should documented and rigorously applied wherever necessary. Resource provisioning should be automated, developer activity should be monitored and logged for auditing purposes, and infrastructure design should be dummy-proof. If you’re relying on staff to manage your deployments to production environments, you’re inevitably going to experience issues.

Example best practice:

AWS Certificate Manager (ACM) — Do delete expired ACM certificates. Don’t risk deploying expired certificates to resources that will complain and cause your app to crash.

CloudFormation — Do turn on and use CloudFormation scripts to automatically manage your infrastructure. Don’t waste time manually configuring your infrastructure in the AWS console.