The cloud, unlike the Force, is not a mystical energy that surrounds us and binds us.

Yet the word cloud has been attributed sorcerous properties which antiquitate your IT infrastructure and empower you to endlessly grow according to your needs without ever lifting a finger. Cloudelius! Everything is done automagically. While theoretically true, in practice setting up your cloud for complete resilience, and autonomous operations is slightly more complex.

Let’s go over some of the issues companies and developers usually face when growing their infrastructure and how to go about solving them.

Setting the Scene

When you start laying out your cloud infrastructurethe first thing you start messing around with is the Console (the administration panel you get upon logging into your cloud provider web interface). The admin panel allows for easy creation of resources such as, VM’s, storage buckets, messaging queues, etc., in an (arguably) intuitive point-and-click UI.

Kickoff day has arrived and now comes that milestone moment when you bring up your first server using the Console. Let’s imagine that you start out with a monolithic app, running on, say, LAMP or MEAN stack. You fire up a machine on your favorite linux distribution, SSH into the machine with the private key the Console generated for you and start installing the necessary software for your application.

As your business grows you’ll need to separate your architecture into more and more servers and services to handle more load and to minimize SPOFs. Want your database on a separate server? No problem! Point, click, and bam you have a new powerful server at your fingertips. Again you SSH into the machine, install the necessary software and it’s up and running.

So now you have a handful of servers in use, and while they may partially share some configuration and software they significantly differ in many aspects. But all’s OK, it’s only a handful of servers after all, which you know by name and exactly what’s running on each: it’s under control. Surprise (enter product team)! A new feature requires a new dependency. You start SSH-ing into the machines one by one, installing the necessary software before deploying the new code. Deploy the code, and….. Error. You forgot one server (Docker can help with these types of issues, but you might still have unique configurations in your Docker hosts which you’ll need to maintain).

By now a few months have gone by and you’ve reached the point where you need a separate testing environment. However, you’ve made extensive manual changes to your servers through SSH and the Console so reproducing your production environment perfectly will be a formidable challenge. Oh, BTW, if you later decide that you need to set up a staging environment as well, you’ll need to do this all over again.

So now for the money time. There is a huge lead in the pipeline, you’ve already pushed them successfully through demos, technical calls and a trial test. Basicly, all you have left is some paperwork that needs to be signed. Then someone on their compliance/legal team asks who has access to the servers, and you have to admit ALL the developers do. Everyone has the SSH keys because everyone checks the logs and maintains the servers. Your hot lead bows out claiming that their information is too classified/important. Depending on where your company is at development-wise this scenario might seem far fetched, or it might hit painfully close to home.

In this post I’ll address some of the challenges that come hand in hand with “The Cloud” and how to tackle them. There isn’t one solution that fits all, but as we’ve been examining and testing our past, present and future implementation of our infrastructure there are some tips and insights that we can be share.

Configuration Management

Obviously, you’re not the first to run multiple servers and have these issue. Smarter people than I have already solved this problem many times over. What used to be accomplished with custom developed shell scripts, is accomplished today with more advanced tools like Chef, Puppet, Ansible and more. Each differs slightly in features and approach, but the core philosophy is the same: maintain a single repository/source of truth for configurations and have a daemon run on the server fetching the configurations and applying them. It’s important to note that configuration doesn’t just mean .conf , .ini files or dot files, but also includes the software installed on the server. Beyond the core philosophy there are a few additional points that make these tools truly invaluable.

Extendable configurations: allows reuse of a base configuration that is applicable to all your servers with specific configurations extended to each of your server groups.

Versioning: you can, and should, save the configuration files in a VCS like Git. This enables you to track changes, add a peer-review process on the changes with the ability to easily rollback.

Visibility: as long as you know the “group” (each tool has its own term for this) of the server you know exactly what’s installed on that server.

Reproducibility: you can spin up 10, 50, 1000 more servers and they’ll be identical. No more SSHing into each one to install and configure it! You can also bring up a local VM with an identical configuration to the one in production.

On a side note, it’s worth noting that since you have a single source for passing the configurations down to your clusters all the server has to do when it’s provisioned is install the daemon and contact the master server (this may differ slightly between tools). This means your start-up (a.k.a user-data) script can be very minimal, lightweight and easier to maintain.

Auto-Scaling Groups and Metrics

Now that your configuration management tool is set up and running to your liking, it’s time to start taking advantage of what “The Cloud” offers. Before we dive in let me get one point out of the way. Dumping additional servers on a performance or scaling problem might work in some cases, but won’t work in all. The majority of the time it will be dependant on the type of software we’re trying to scale. I’m not going to go into scaling software here as it deserves its’ own post (several posts even) so I’m assuming that your software allows for horizontal scalability.

AutoScaling is a pretty intuitive term. It simply means “I want the cloud to scale for me”. No more manually bringing up additional instances when you’re featured on Hacker News. Some people think that just by moving your application from your dedicated servers to the cloud, the job is done and the magic of the cloud will take care of the rest. Unfortunately, that’s not the case. For the cloud to automatically scale your infrastructure you have to tell it how and when:

How: By identifying the “groups” in Auto-Scaling Groups (abbreviated ASG). You must first sort your servers into logical groups which act together for a single purpose and would benefit from horizontal scaling. Some common examples are web servers, database replicas (to a point), a caching cluster, etc. Usually you can easily identify these groups because they share an identical configuration in your configuration management tool.

When: Once you’ve identified your groups, you’ll need to determine when you’ll need more power, and when you’ll need less. By doing so you have now identified your scaling up and down events respectively. It’s not always easy to determine what should trigger these events. For example, for web servers the trigger could be CPU load or even response time, while for a caching cluster you might want to look at the hit-ratio or cache evictions and spin up more machines if the ratio/count changes between certain limits.

As you can see, the “when” of ASG is intricately connected to system metrics. You’ll need to send your metrics to the monitoring solution offered by your cloud provider (ex. CloudWatch for Amazon Web Services) otherwise “The Cloud” won’t know when to scale your servers. If you don’t want to send metrics to your cloud provider, you could send them to your own monitoring solution and write your own scripts to bring servers up/down using the cloud’s API; although that’s a lot more work and in my opinion you should focus your energy on your core business.

Identifying what metric to use and what limits to set is rarely straightforward and might require trial and error before you get it right. What works for one server group might not work for another as its needs and usage pattern are completely different. Another great benefit of ASG is cost reduction. When your load is lighter you can take advantage of ASG to remove resources not utilized and save on costs.

Logging

Now your servers are running in the cloud and they’re configured automatically using one of the above technologies, but you still need to SSH into the machines to read logs and debug issues. In the beginning everyone does this as it’s easy and straightforward, but it doesn’t scale well. When you have an AutoScaling Group of servers you might not even know which server served which request or which one holds the log file you’re looking for.

Centralized logging is a must when you architect your cloud solution. You’ll need to think about how logs are generated, how fast they’re generated and how they’ll be used. A popular solution involves the setting up an ELK stack: you can roll your own or use a managed service. Alternatively, some cloud providers offer their own solutions such as CloudWatch Logs and of course there’s always the good old syslogd.

A major benefit of centralized logging is the elimination of logs file buildup that eats up disk space in your servers. I’m sure you’ve experienced a midnight on-call alert of “Not enough disk space”. When you log in you discover huge log files that haven’t been set up to rotate properly. Centralized logging also solves some of the issues you’ll face with ASG. When servers are removed due to low load they also lose their entire hard disk and state. If those servers had any logs on their local filesystem you’ll lose those forever. Unless you’re writing to a centralized location.

Architecture Management

Similar to configuration management there are also tools to configure, document and version your architecture as a whole. There are vendor specific tools like CloudFormation which only works on Amazon’s cloud, or Heat which works for OpenStack, and there are cloud agnostic solutions that support many cloud providers (and even SaaS providers) like Terraform by HashiCorp. These tools allow you to describe your entire architecture in a configuration file(s); and when I say “entire” I mean everything from security roles, networking, servers, to DNS and more. That file can then be committed into a VCS like the configuration management files for peer-review and backup.

Now let’s say you want to set up a new staging environment identical in every way to your production environment. You can easily do this by running the configuration you laid out with your architecture tool. Want to adjust your architecture? No more clicking and changing things in the Console, simply change the configuration, run the tool and let it do the rest. Your architecture is now fully reproducible. Make the changes you need and replay them on your other accounts.

Locking it Down

As a rule of thumb, if you have to SSH into an instance to fix something, there is a deeper issue that needs to be addressed in order to make your cloud setup more resilient. A good cloud setup is one that doesn’t need human interference on specific machines. Furthermore, you should always consider your servers as groups; any fix performed on one should be applied to all instances.To achieve this it needs to be conducted through either a configuration management tool or an architecture management tool, or through relieving some other pain-point (like logging, monitoring, metric gathering, automatic deployments, etc.).

If everything is automated correctly, you generally shouldn’t have to SSH into a server. Once you reach this state the next big step will be to shut down SSH access completely. Not all companies go this route, regardless of size. Netflix for example is famous for giving all its engineers SSH access (check out this cool talk).

We believe it’s best to lock down everything as tightly as possible, especially when you’re dealing with customer data. Moreover, this tests your automation to the fullest extent: any issues you haven’t fully automated and taken care of will be crystal clear once you can’t use SSH anymore. And once those issue surface, you’ll know what to focus on.

Cloud in Progress

I realize the setup I described in this post can be a little overwhelming, especially for startups just starting out and that this “utopian” configuration is a lot of work to get up and running, even more so when you have a small team. But like everything in tech, small incremental changes can be done over time. There’s absolutely no need to have all this set up and functioning straight out the gate. It’s a lot of technologies to learn, implement and get used to. So take your time and make sure the adoption of these changes is not only reflected in your day-to-day work processes but also in your company culture .

An approach I found to work for me is to keep a log of manual actions. I take note of the service I’m working on, the server I had to connect to and the nature of the fix (the reason I connected), then once in awhile I’ll go over that log and try to group those entries into their root cause and try to weed out the bulk, choosing what to fix next based on its impact on performance and productivity.

Let me know your tips for automating the cloud and your architecture, I’m always looking for new tricks and tips.