An Infrastructure Guide for Founders

How to avoid security debt with early AWS design patterns.

You are a founding team of engineers who have decided to move beyond the proof of concept phase for your product. Or, maybe you’re planning to leave your current PaaS (Like Remind, or Coinbase). You are designing your AWS plans from the ground up.

Let’s include security into those discussions.

Your answers to these questions should contribute to other development value systems, like continuous deployment, shared responsibility, and developer velocity.

Have the following discussion points with your team.

What are we going to do with logs?

If you don’t have a coherent strategy around logging, you will be mostly useless during a security incident. Additionally, you’ll have no data to inform policy decisions around actual usage of your infrastructure.

More importantly, logs assist product availability more than security risks, which is why centralized logging becomes such a high leverage investment.

Security roles are often the ones screaming for log infrastructure, but performance troubleshooting can sometimes end up being the primary use afterwards. No one wants to waste time troubleshooting with logs in disparate places.

To decompose the problem: you are going to have a whole constellation of systems producing wildly differing logs.

By default, systems will log locally, or not at all. Logging provides no leveraged resource for troubleshooting or investigation unless you plan to corral your logs into a centralized and queryable system.

A conversation with a team should consider these buckets:

Application: Are we writing code that decorates logs with useful information?

Are we writing code that decorates logs with useful information? System: Are the instances we are running writing useful things to /var/log and should we be capturing them?

Are the instances we are running writing useful things to and should we be capturing them? Infrastructure: Are we storing logs that our cloud infrastructure produces (CloudTrail)? Do we want to capture network flow data as well?

A good default location for these would be CloudWatch Logs. You can dump logs at scale into it with pretty reasonable associated costs. Then the team has a single place to query logs and build rudimentary dashboards.

Will we eventually need a multiple account strategy?

Some larger companies eventually split into multiple accounts. This may be at the request to serve a large customer. Or, maybe you plan on spinning out a separate product that requires its own tenancy. Even young startups can gather tens of accounts, with larger companies having far more.

Assuming an early, non-monolithic account strategy can help avoid multiple downstream refactors of your infrastructure. Assuming this earlier would:

Help create a well defined blast radius.

Give you an early decision on where you manage your centralized logs (a logs account) or identity (a singular IAM source of identity).

Encourage the use of “Roles” which forever reduces the impact of IAM key leakage.

Enable features like AWS Organizations which can later enforce mandatory policies on accounts, and centralize billing.

The biggest voice for monolithic approaches to AWS accounts (and code) has been developer speed and simplicity. It’s important to consider how valuable that early velocity will be, and how drastic the incoming developer complexity will be when you achieve growth. If you can head this off by designing for a future “mid game”, you can avoid a lot of churn.

Adjusting to AWS roles and the bash gymnastics to make developer workflows succeed can take some time to hammer out. In most cases, a multiple account strategy is inevitable, so there’s rationale in confronting this problem as early rather than later. Fortunately, lots of companies face this problem, and lots of open source projects exist to ease this pain.

How will we directly troubleshoot production systems?

You have a production outage. An engineer wants to ssh to a production host, sudo to root, and tcpdump traffic as a part of troubleshooting.

At some point an engineer will want to directly interact with a production server to troubleshoot an issue. Pretty invasive access will be available to get to the root of a problem. While this access should be available in a crisis, it should be thoughtfully designed, like keeping your rifles under lock and key.

In the future, administrative production access should be considered rare and special. It’s possible to limit production access pretty drastically.

Centralized logging or other performance monitoring tools (New Relic, Datadog) can drastically reduce the overall need to directly invade production.

You can enforce a “bastion host” network and authentication model for administrative access. This creates an intentional and high security route for administrative access.

Access to production can be temporarily granted for one off situations, and otherwise be totally unavailable. This make a stolen credential, or a former employee, a less terrifying possibility.

Specifically considering the risks from production troubleshooting early on will help limit the overall damage caused by multiple types of incidents.

Though it’s probably more important to mention that will also help limit manually invoked “drift”. This is a known source of outages as well and we’ll cover it more next, with another design pattern to discuss.

Will infrastructure share our normal engineering standards?

Some engineering teams will enforce a very regimented and disciplined culture around their product, but then fumble around and point-and-click their infrastructure together in the console of their cloud of choice.

This doesn’t need to be this way anymore, with the advent of “configuration as code” products like Terraform and CloudFormation. These allow infrastructure to be part of your deploy pipeline just like your application build. You write code that represents your infrastructure, and builds your infrastructure like you’d build your app.

Artificially limiting access to your infrastructure’s consoles will help promote engineering standards to your infrastructure and limit drift. An inability to casually drift is a strong litmus test that helps prove how resistant an infrastructure is to malicious changes, or general atrophy from changes your team forgot to revert.

Once your infrastructure essentially lives in a repository you can enforce reviews, add unit tests, or other strategies you have to enforce overall code quality that you already enforce in product.

You can limit the ability to modify infrastructure to a build server or a CI/CD pipeline similar to what you may already be familiar with, and have better certainty about the immutability of servers from erronous or malicious change.

How will our network be segmented?

Conversations about network layouts are pretty standard during infrastructure planning conversations, even to avoid improper peering relationships or allocating for the specific number of subnets you’ll need.

Make sure the early team understands high level networking concepts in your cloud environment and plans for an approach that only exposes services to the internet intentionally.

For instance, you might want to expose a load balancer, but maybe not your internal-only Jenkins. Will your planning take public and private segmentation into account?

This has significant reliance on proper engineering standards, as discussed earlier. We don’t want casual changes to occur, and networking is a sensitive area of risk.

Once a network layout is decided on, you’ll want to avoid any drift that exposes sensitive servers, like a caching server or database. These often become exposed during temporary troubleshooting endeavors, or when someone lazily allows any protocol with a 0.0.0.0/0 rule to a security group.

Having a disciplined approach to security groups and network ACL’s can save you when a new network based attack runs wild.

Where are we going to store our secrets?

A common anti-pattern for early teams is to store credentials and API keys in source code, environment variables, or the copy paste buffer of an engineer’s laptop. Leaked secrets are the primary root cause of security incidents I’ve worked on.

Each joke is always different, but the punchline is the same. An API secret is on Pastebin or Github.

A lot has been written on this subject.

Suggesting a total refactor of how a company stores and consumes secrets can be one of the hardest things to recommend to a more mature environment. It can require reworks of custom applications maintained by engineers or overhauls of how systems are built in general. Approaching this problem very early can help drive standards and eliminate a future area of debt altogether.

The highest priority issue is keeping secrets out of repositories, Slack channels, and copy paste buffers, by any means necessary.

Second to this, is being able to rotate them seamlessly and quickly. This is a much harder goal.

Conclusion

Security incidents involving cloud infrastructure have fairly predictable root causes. These root causes are generally exposed because of multiple areas of long standing security debt. We’ve discussed several early investments that help avoid these debts as infrastructure eventually becomes complex.

But, it’s important to note that security goals often feel like distractions if they don’t envelop other priorities while building an early company. I’ve done my best to select security patterns that also hit those other areas and avoid security patterns without shared value. Those would come later.