In my last post, I mentioned that we’re using SaltStack (Salt) without a master. Without a master, how are we bootstrapping our instances? How are we updating the code that’s managing the instances? For this, we’re using python virtualenvs, S3, autoscaling groups with IAM roles, cloud-init and an artifact-based deployer that stores artifacts in S3 and pulls them onto the instances. Let’s start with how we’re creating the AWS resources.

Orchestration

We’re using Salt for orchestration. A while ago I wrote some custom code for environment provisioning that started with creating MongoDB databases and Heroku applications and later added management of AWS resources. I spent a few weeks turning our custom code into state and execution modules for Salt. We’re now using the following Salt states for orchestration of AWS resources:

Through these states we create all of the resources for a service and environment. Here’s an example of a simple web application:

Ensure myapp security group exists: boto_secgroup.present: - name: myapp - description: myapp security group - rules: - ip_protocol: tcp from_port: 80 to_port: 80 source_group_name: amazon-elb-sg source_group_owner_id: amazon-elb - profile: aws_profile {% set service_instance = 'testing' %} Ensure myapp-{{ service_instance }}-useast1 iam role exists: boto_iam_role.present: - name: myapp-{{ service_instance }}-useast1 - policies: 'bootstrap': Version: '2012-10-17' Statement: - Action: - 'elasticloadbalancing:DeregisterInstancesFromLoadBalancer' - 'elasticloadbalancing:RegisterInstancesWithLoadBalancer' Effect: 'Allow' Resource: 'arn:aws:elasticloadbalancing:*:*:loadbalancer/myapp-{{ service_instance }}-useast1' - Action: - 's3:Head*' - 's3:Get*' Effect: 'Allow' Resource: - 'arn:aws:s3:::bootstrap/deploy/myapp/*' - Action: - 's3:List*' - 's3:Get*' Effect: 'Allow' Resource: - 'arn:aws:s3:::bootstrap' Condition: StringLike: 's3:prefix': - 'deploy/myapp/*' - Action: - 'ec2:DescribeTags' Effect: 'Allow' Resource: - '*' 'myapp-{{ service_instance }}-sqs': Version: '2012-10-17' Statement: - Action: - 'sqs:ChangeMessageVisibility' - 'sqs:DeleteMessage' - 'sqs:GetQueueAttributes' - 'sqs:GetQueueUrl' - 'sqs:ListQueues' - 'sqs:ReceiveMessage' - 'sqs:SendMessage' Effect: 'Allow' Resource: - 'arn:aws:sqs:*:*:myapp-{{ service_instance }}-*' Sid: 'myapp{{ service_instance }}sqs1' - profile: aws_profile Ensure myapp-{{ service_instance }} security group exists: boto_secgroup.present: - name: myapp-{{ service_instance }} - description: myapp-{{ service_instance }} security group - profile: aws_profile Ensure myapp-{{ service_instance }}-useast1 elb exists: boto_elb.present: - name: myapp-{{ service_instance }}-useast1 - availability_zones: - us-east-1a - us-east-1d - us-east-1e - listeners: - elb_port: 80 instance_port: 80 elb_protocol: HTTP - elb_port: 443 instance_port: 80 elb_protocol: HTTPS instance_protocol: HTTP certificate: 'arn:aws:iam::12snip34:server-certificate/a-certificate' - health_check: target: 'HTTP:80/' - attributes: access_log: enabled: true s3_bucket_name: 'logs' s3_bucket_prefix: 'myapp-{{ service_instance }}-useast1' emit_interval: '5' - cnames: - name: myapp-{{ service_instance }}.example.com. zone: example.com. - profile: aws_profile {% for queue in ['example-queue-1', 'example-queue-2'] %} Ensure myapp-{{ service_instance }}-{{ queue }} sqs queue is present: boto_sqs.present: - name: myapp-{{ service_instance }}-{{ queue }} - profile: aws_profile {% endfor %} Ensure myapp-{{ service_instance }}-useast1 asg exists: boto_asg.present: - name: myapp-{{ service_instance }}-useast1 - launch_config_name: myapp-{{ service_instance }}-useast1 - launch_config: - image_id: ami-fakeami - key_name: example-key - security_groups: - base - myapp - myapp-{{ service_instance }} - instance_type: c3.large - instance_monitoring: true - cloud_init: scripts: salt: | #!/bin/bash apt-get -y update apt-get install -y python-m2crypto python-crypto python-zmq python-pip python-virtualenv python-apt git-core wget https://s3.amazonaws.com/bootstrap/salt/bootstrap.tar.gz tar -xzvPf bootstrap.tar.gz time /srv/pulldeploy/venv/bin/python /srv/pulldeploy/pulldeploy.py myapp {{ service_instance }} -v && salt-call state.sls elb.register - availability_zones: - us-east-1a - us-east-1d - us-east-1e - suspended_processes: - AddToLoadBalancer - min_size: 30 - max_size: 30 - load_balancers: - myapp-{{ service_instance }}-useast1 - instance_profile_name: myapp-{{ service_instance }}-useast1 - scaling_policies: - name: ScaleDown adjustment_type: ChangeInCapacity scaling_adjustment: -1 cooldown: 1800 - name: ScaleUp adjustment_type: ChangeInCapacity scaling_adjustment: 5 cooldown: 1800 - tags: - key: 'Name' value: 'myapp-{{ service_instance }}-useast1' propagate_at_launch: true - profile: aws_profile autoscale up alarm: boto_cloudwatch_alarm.present: - name: 'myapp-{{ service_instance }}-useast1-asg-up-CPU-Utilization' - attributes: metric: CPUUtilization namespace: AWS/EC2 statistic: Average comparison: '>=' threshold: 50.0 period: 300 evaluation_periods: 1 unit: null description: '' dimensions: AutoScalingGroupName: - myapp-{{ service_instance }}-useast1 alarm_actions: - 'scaling_policy:myapp-{{ service_instance }}-useast1:ScaleUp' - 'arn:aws:sns:us-east-1:12snip34:hipchat-notify' insufficient_data_actions: [] ok_actions: [] - profile: aws_profile autoscale down alarm: boto_cloudwatch_alarm.present: - name: 'myapp-{{ service_instance }}-useast1-asg-down-CPU-Utilization' - attributes: metric: CPUUtilization namespace: AWS/EC2 statistic: Average comparison: <= threshold: 10.0 period: 300 evaluation_periods: 1 unit: null description: '' dimensions: AutoScalingGroupName: - myapp-{{ service_instance }}-useast1 alarm_actions: - 'scaling_policy:myapp-{{ service_instance }}-useast1:ScaleDown' - 'arn:aws:sns:us-east-1:12snip34:hipchat-notify' insufficient_data_actions: [] ok_actions: [] - profile: aws_profile

I know this doesn’t look very simple at first, but this configuration is meant for scale. The numbers and instance sizes here are fake and don’t reflect any of our production services; they’re meant as an example, so adjust your configuration to meet your needs. This configuration carries out all of the following actions, in order:

Manages a myapp security group with two rules, meant for blanket rules for this service. Manages an IAM role, with a number of policies. Manages a myapp-{{ service_instance }} security group, meant for testing security group rules or per-service_instance rules. Manages an ELB and the Route53 DNS entries that point at it. Manages two SQS queues. Manages an autoscaling group, its associated launch configuration, and its scaling policies. Manages two cloudwatch alarms that are used for the autoscaling group’s scaling policies.

I say manages for all of those resources because making a change to them is simply a matter of modifying the state then re-running Salt.

From the Salt bootstrapping perspective, #2 and #6 are the key things we’ll be looking at. The IAM role allows the instance to access other AWS resources — in this case, the deploy directory of the bootstrap bucket. The launch configuration portion of the autoscaling group adds a Salt cloud-init script that installs Salt’s dependencies, wgets a tarred relocatable virtualenv for Salt and our deployer, untars it, then runs the deployer.

In the IAM role, autoscaling configuration, and cloud-init we have a special process for managing our ELBs. Our autoscaling groups disable the AddToLoadBalancer process, so new autoscaled instances won’t immediately be added to the ELB. Instead, in our launch configuration, after a successful initial Salt run the instance registers itself with its own ELB. Using the IAM policy we limit access to only allow instances that are associated with an ELB to register or deregister themselves.

Also in the IAM role we grant access only to a service’s particular deployment resources, to limit access across services. We similarly restrict access by the service_instance, where necessary, to restrict access across environments of a service.

Unfortunately AWS doesn’t provide the ability to limit access to describe tags on resources. We use autoscaling group tags in the bootstrapping process, which we’ll get to later, when discussing naming conventions.

Instance configuration

When the orchestration is run the resources are created and the bootstrapping process for the instances starts. This process starts from the launch configuration as described above, which in short is:

Salt’s dependencies are installed. Salt and the deployer are fetched from S3 via wget. This artifact is public since it’s just Salt and deployer code, neither of which are sensitive. We’ve munged the link to avoid third parties using a Salt version they don’t control. The deployer is run, and if successful the instance is registered with its ELB.

To properly bootstrap the system it’s necessary for the system to pull down its required artifacts and to build itself based on its service and environment. The deployer starts this process. Its logic works as follows:

Fetch the base and service artifacts. Create a /srv/base/current link that points at base’s current deployment directory. Create a /srv/service link that points at the service’s deployment directory. Create a /srv/service/next link to point at the artifact about to be deployed. Run pre-release hooks from the service repo. Run ‘salt-call state.highstate’. Create a /srv/service/current link to point at the artifact currently deployed. Run post-release hooks from the service repo.

We have a standard Salt configuration for all services, which is why we create a /srv/service link. Salt can always point to that location. Specifically, we point at /srv/service/next. In the above logic we run highstate between the creation of the next and current links. By doing so we can deploy a change that relies on a system dependency. Here’s our Salt minion configuration:

# For development purposes, always fail if any state fails. This makes it much # easier to ensure first-runs will succeed. failhard: True # Show terse output for successful states and full output for failures. state_output: mixed # Only show changes state_verbose: False # Show basic information about what Salt is doing during its highstate. Set # this to critical to disable logging output. log_level: info # Never try to connect to a master. file_client: local local: True # Path to the states, files and templates. file_roots: base: - /srv/service/next/salt/config/states - /srv/base/current/states # Path to pillar variables. pillar_roots: base: - /srv/service/next/salt/config/pillar - /srv/base/current/pillar # Path to custom grain modules grains_dirs: - /srv/base/current/grains

The deployer only handles getting the code artifacts onto the system and running hooks and Salt. Salt itself determines how the system will be configured based on the service and its environment.

We use Salt’s state and pillar top systems with grains to determine how a system will configure itself. Before we go into the top files, though, let’s explain the grains being used.

Standardized resource naming and grains

We name our resources by convention. This allows us to greatly simplify our code, since we can use this convention to refer to resources in orchestration, bootstrapping, IAM policy, and configuration management. The naming convention for our instances is:

service-service_instance-region-service_node.example.com

An example would be:

myapp-testing-useast1-898900.example.com

This hostname is based off of the autoscale group name, which would be:

service-service_instance-region

Or, in this example:

myapp-testing-useast1

During the bootstrapping process, when Salt is run a custom grain is called that fetches its autoscaling group name and the instance-id, then parses them and returns a number of grains:

service_name (myapp)

service_instance (testing)

service_node (898900)

region (useast1)

cluster_name (myapp-testing-useast1)

service_group (myapp-testing)

At the beginning of the Salt run, Salt ensures a hostname is set, based on the grains. The custom grain always checks to see if there’s a hostname set based on our naming convention. If so, it always parses the hostname and returns grains based on that. We do this to avoid unnecessary boto calls for future Salt runs. Another reason we set a hostname is so that we can use it in monitoring and reporting, to get a human-friendly name for instances.

Now, let’s go back into the top files, based on this info.

Top files using grain matching

Here’s an example pillar top file:

base: '*': - base - myapp - order: 1 {% for root in opts['pillar_roots']['base'] -%} {% set service_group_sls = '{0}/{1}.sls'.format(root, grains['service_group']) -%} {% if salt['file.file_exists'](service_group_sls) %} 'service_group:{{ grains["service_group"] }}': - match: grain - {{ grains['service_group'] }} - order: 10 {% endif %} {% endfor -%}

The Jinja used here is to include a file if it exists and to ignore it otherwise. By doing this we can avoid editing the top file for most common cases. If a new environment is added, then a developer only needs to add the myapp-new_environment.sls file, and it’ll be automatically included.

We have these included in a specific order, since conflicting keys are overridden in inclusion order. We include them in order of most generic to most specific. In this case, for an instance with myapp-testing as its service group, it’ll include base, then myapp, then myapp-testing pillar files.

For instance, if we were only enabling monitoring in the testing environment, we could set a boolean pillar like so:

myapp.sls:

enable_monitoring: False

myapp-testing.sls:

enable_monitoring: True

This allows us to set generic defaults and override them where needed, so that we can use the least amount of pillars possible.

Here’s an example of our states top file:

base: '*': - base - order: 1 'service_name:myapp': - match: grain - order: 10 - myapp 'service_name:myappbatch': - match: grain - order: 10 - myapp

In this file we’re including base, then we’re including a service-specific state file. In this specific case both myapp and myappbatch are so similar that we’re including the same state file. For this case our differences are handled by pillars, rather than splitting the code apart.

Deployment

Notice that the bootstrapping is written in such a way that it’s simply doing an initial deployment. All further deployments use the same pattern. Salt is an essential part of our deployment process. It’s run on every single deployment. If a deployment is simply a code change with no Salt changes, the run is incredibly fast, since salt-call returns no-change runs in around 12 seconds. Since we’re always deploying base changes with any deploy, we also have a mechanism to update the base repository and make Salt changes on every system immediately.