By Vaidik Kapoor

About a year ago, we started setting up our infrastructure platform to make daily system operations and infrastructure maintenance easy and enable our developers in a way that they can move faster. This post is part one of a series on how we use Ansible at Grofers to manage our infrastructure.

Background

At Grofers, Ansible is the primary language for infrastructure — may it be configuration management, automation, deployment or doing random things, we use Ansible for just about everything.

We strongly believe that end-to-end infrastructure for running a service in production should be managed by developers themselves. It makes individual teams as small, contained units that don’t need sysadmins. Developers are enabled to debug things better in production as they have the best understanding of their own systems. This helps in resolving issues quickly and gives autonomy with responsibility to everyone in a manner that experiments can be carried out with freedom.

But to achieve all of this, we had to choose one tool that can solve these things in the best possible way for our team. We chose Ansible over other alternatives (like Chef, Puppet, Salt, etc.) for of a few reasons:

Extremely simple syntax — Ansible playbooks are written using simple YML. It’s not a different programming language. It runs sequentially. It’s super easy to understand, to an extent that anyone can read and tell what might be going on in an Ansible playbook. Puppet, on the other hand, has its own DSL (a complete language in its own) and its execution model is not sequential. Puppet forms a Directed Acyclic Graph of all the resources that have to be provisioned and executes commands as it best understands they should be executed to avoid any ambiguity. While this is extremely powerful and leaves lesser chances for mistakes, it is not easy for most developers to get their head around it. Chef recipes are written in Ruby and in a company full of Python developers, it just doesn’t make much sense to have everyone learn Ruby. We wanted an easier learning curve for people to adopt the new infrastructure tooling. No packages required on the remote host — Ansible runs completely over SSH. To execute commands on remote hosts, all you need is a working SSH server. It executes commands on remote hosts over SSH and leaves nothing extra after it has completed its execution. Most other configuration management tools either require you to install an agent or the tooling itself to be able to manage resources on the host. Support for Docker — Ansible has had support for working with Docker containers for a very long time. In fact, Ansible’s docker modules support working with Docker containers in a very similar way as docker-compose. While we were not using Docker back when we started, we knew we cannot stay away from the new developments in the container world and we wanted our choice of new tooling to support us working with Docker containers as well. Completely written in Python — while we boast of being a polyglot company, we still have Python as our go-to programming language unless there is a strong reason to not use it. Most of our engineers are comfortable with the language. So it naturally made sense for us to choose Ansible because anyone can extend Ansible by writing plugins and modules in Python. Deep AWS integration out of the box — our infrastructure runs on AWS and Ansible integrates deeply with AWS. You can manage almost all common AWS services using Ansible and this has a lot of value in going all the way with automation. For example, it is possible to write Ansible playbooks that can launch completely working Elasticsearch clusters from launching the instance to installing Elasticsearch and configuring each server to join the cluster, all of this without touching the AWS web interface or any other manual intervention at all. Support for working with dynamic infrastructure — the entire point of using a cloud infrastructure provider like AWS is that resources can come and go depending upon requirements. While working with dynamic infrastructure, it is important that your tooling also adjusts to this. Ansible’s Dynamic Inventory feature makes it possible to work with a dynamic infrastructure using cloud APIs. This is a powerful feature and comes in handy real quickly when you get into handling hundreds of instances. Instead of dealing with IP addresses, you start thinking in terms of conventions. You can do really powerful things like building infrastructures that are hybrid i.e. they span multiple geographies or even multiple providers.

Ansible had all the right things to make it easier for our team to see value in it and start using it for all sorts of configuration management and automation. And then we started exploring new ways to get it deeply integrated in all our engineering teams to have our engineers use Ansible on a daily basis, so that it becomes an integral part of their work and they start depending on it. Slowly we managed to build enough tools and solutions using Ansible that made our developers’ lives easy and they found value in learning it.

Use-Cases

In this post, we will touch briefly upon all the major use-cases of Ansible at Grofers, how we have structured it to suit our needs and our infrastructure conventions.

Launching New EC2 Instances

One of the most important things for any startup is to be able to move fast. Conventionally, companies have a central sysadmin team that sets up machines when developers request. This is usually a blocker in day-to-day work and slows engineers down if they cannot quickly launch infrastructure for testing purposes or going live.

With the cloud coming in, this was made easy as developers get a software abstraction to manage their infrastructure. Usually companies create internal dashboards leveraging cloud APIs to smoothen the process of managing infrastructure. We do this using Ansible. Instead of dashboards, we use Ansible as our toolchain for managing infrastructure resources.

Developers can launch new instances using our custom Ansible playbooks. We leverage prompt inputs to let users fill in just the required set of inputs (like instance type, storage size, availability zone, cluster name, etc.). Everything in our infrastructure is driven by conventions. There are a few things that we don’t allow changing or choosing (for example the distribution of Linux as that’s hardly a genuine requirement and it also makes it much easier to maintain the infrastructure). This playbook takes care of making sure that all the conventions are met (like naming conventions, choosing the right subnet, etc.) or the user gets a relevant error. Using this playbook, one can launch one instance or a cluster of hundreds of instances — that’s entirely up to the developer.

Similarly, other management tasks like adding more storage or swap space is also done using Ansible playbooks.

Setting Up Services On New EC2 Instances

In the previous section, we had mentioned how developers launch new instances. These instances are launched using AMIs that have been baked by the infrastructure team with certain services and bootup scripts in them.

We use Ansible playbooks coupled with open-source roles as well as roles written in-house for setting up multiple utility services on every EC2 instance. These utilities include things for monitoring, log aggregation, instrumentation agents, sets up NTP, a local mail server for sending internal emails and common packages (like vim, tmux, build-essential, git, curl, etc.). This playbook also configures the FQDN according to our conventions and user inputs, enables some custom logging and grants the right set of users access to an EC2 instance.

Deployments

Deployments are very straight forward. Every team is responsible for writing their own Ansible roles and playbooks. Every team maintains their own Ansible repo. Team members review each other’s Ansible pull requests as they review their code. At times when there is any dispute or advice is needed, infrastructure team is called in to review. Team specific Ansible projects/repos are bootstrapped from ansible-core (or popularly known as the mother of all Ansible projects at Grofers). ansible-core is responsible for bootstrapping new Ansible projects so that developers don’t have to take decisions for Ansible best practices and all custom things like dynamic inventory setup, common roles, etc. are all put in place by ansible-core. We will write more about ansible-core in another blog post.

While we are polyglot at Grofers, we do have a couple of technologies as our mainstream stack. And because we have a lot of similar components, it was possible for us to share knowledge around a lot of infrastructural components. Playbooks and roles written once are usually referred by other teams as starting points. These playbooks are simple in nature. They usually just clone the repo, install the dependencies and start the service using supervisord or whatever is relevant. We also use some Ansible orchestration features at times like run_once to run things like Django tasks which need to be executed only once and not on every machine in the cluster. Due to commonalities in our stack, it was easy for our team to pick up Ansible. Almost every developer at Grofers knows how to execute Ansible playbooks for deployments while most know how to write one.

We also use the same playbooks for autoscaling with AWS. We have written wrappers around Ansible in our home-baked AMIs that make sure that the latest Ansible code is executed when a new machine boots up during auto-scaling. This also reduces the work of developers as they have to do almost nothing extra to setup auto-scaling. Since auto-scaling is very easy to setup, everyone sees value in this. As of today we have 35+ services that are behind auto-scaling groups.

Databases

We have multiple kinds of database systems at use at Grofers. PostgreSQL, MySQL, MongoDB, and Redis are a few to name with Postgres being the most used. We manage most of these databases to different degrees with Ansible. We use Ansible for only setting up clusters when it comes to MongoDB and Redis. The entire cluster setup is done using Ansible.

We use Elasticsearch extensively as well. And we have written a playbook to not just setup Elasticsearch but also scale-out an existing cluster using Ansible. This playbook does everything from end-to-end, including launching a similar EC2 instance, installing Elasticsearch, configuring Elasticsearch to disable shard balancing when a new node is added, add the newly launched instance in the cluster, increase replication and then enable shard balancing again. These complex setups are possible and developers can do them easily with Ansible.

Our Postgres setup is slightly different. We use RDS on AWS for managing our Postgres clusters. The cluster creation process is manual as that doesn’t happen too often. We use the RDS web interface for this. However, all databases and their users are managed using custom Ansible roles. Which user has what privileges on which databases is all managed through Ansible. We use Ansible for managing Cloudwatch alarms for each RDS instance as well.

On-boarding New Team Members

A small but a rather painful use-case. This is something that needs to be done at every company. Allow the user to connect to the VPN, share new AWS IAM credentials with the user, direct the user to relevant documentation, etc. We use Ansible for these painful and redundant tasks that nobody likes to do, thus making our lives easier. We have automated some of the painful tasks of onboarding a new team member and planning to do more.

What’s next?

While we have only glanced over various use-cases of Ansible at Grofers, there is a lot more that we would like to discuss and get feedback about the way we have architected our infrastructure, the tooling we have built in-house and how Ansible fits in all of this. Watch out this space for more articles in this series that will deep dive in the implementation and architectural details of our infrastructure and use of Ansible.

If this type of work interests you, we are always looking to work with talented engineers.

Discussion on Hacker News

Follow the discussion on Hacker News.