An automated AWS service failing due to a missing IAM permission can have surprising causes. Examining the CloudTrail log the error leaves can be handy to quickly pinpoint any missing or incorrect permissions.

The problem

We recently found ourselves debugging an IAM permission set in the context of launching EMR clusters.

Launching a cluster requires an IAM role with an extensive set of permissions — needs to be able to launch the instances, maybe create security groups, create SQS queues and many more.

The default role AWS provides covers all these and much more, including ec2:TerminateInstances and sqs:Delete* on * -- take a look at aws emr create-default roles help for a complete list!

To avoid running automated clusters under such a powerful [and potentially damaging] role together with the rest of the infrastructure, initially we ran these in a separate AWS account — which brought a number of different permission issues: buckets shared between accounts, object read permissions, etc..

Eventually we decided to run the EMR clusters in our production account, but we needed a more restricted IAM role first — in particular, we need to limit permissions based on resources. While the docs cover which permissions are needed to run instances and the actual permissions used by the launched EMR clusters can be deduced from the launch parameters, problems can arise if the resources don’t match exactly how the EMR implementation runs the commands — the cluster failing to run with a rather spartan permission denied message.

A solution for debugging the role is CloudTrail: by executing the process and investigating its trace, we can [iteratively] construct such a role.

Limiting the role

Our first idea was to set limits with the --tags parameter of the aws(1) emr create-cluster command -- since all resources created by EMR are tagged with these tags, it should be possible to give the EMR role permission to create/destroy resources tagged like that. According to the resource-level permission docs, the tags are passed to every command used to launch the EC2 instances.

So we went ahead and granted RunInstances (among others) to the EMR service role, limiting it to the resources with an arbitrary tag of ours: EMRTag=EMRValue . This would ensure that we can track every resource created by EMR by filtering that tag.

However, the clusters wouldn’t launch, dumping the not too informative message “the EMR service role hasn’t enough permissions”. Here’s where investigating what’s going on in CloudTrail can be handy.

Setting up the new trail

CloudTrail uploads the log to an S3 bucket, and can optionally use an SNS topic as well. create-subscription can create the bucket for us, set up its policy, and start the logging process in a single command:

aws cloudtrail create-subscription --name SampleTrail --s3-new-bucket sample-bucket

With this done, we can now attempt to kick off EMR and wait for CloudTrail to upload the logs for the run to the bucket.

Examining the logs

Once the EMR launch has been executed and the log has arrived in the S3 bucket, now we can begin analyzing it. There are many tools that can be used for piping to, displaying and filtering the logs — largely a consumer’s choice. In this post we’ll focus on a quite universal tool — the Unix command line.

CloudTrail stores the logs in a one subdir per day fashion so if you don’t feel like selecting the appropriate period (and the traffic is not that high) you can download today’s dir using the aws s3 sync command:

aws s3 sync s3://sample-bucket/AWSLogs/012345678990/CloudTrail/us-east-1/2017/09/07/ logs

where logs is a local dir to sync to.

The logs are stored compressed, and since uncompressing them can take a large amount of space even with moderate traffic, its a good practice to work with them in their compressed form — this usually means zcat(1), zgrep(1) and zless(1), etc., instead of their z-less counterparts [1].

Grepping JSON

The format for logs is JSON, so jq(1) [2] is an excellent tool for examining them. You can always [z]less(1) a file as a quick reminder of the event format (jq(1) doesn’t work on compressed files so be sure to pipe it through zcat(1) if needed):

zcat 012345678990_CloudTrail_us-east-1_20170831T1510Z_Oiv3a7oQ66XZHfaJ.json.gz | jq . | less

where jq . is a noop filter that acts as a formatter:

We now filter the events we’re looking for — instance creation ( RunInstances ) by EMR (which has its own user agent) that returned an error code of AccessDenied:

zcat * | jq '.Records[] | select (.eventName == "RunInstances" ) | select(.userAgent == "elasticmapreduce.aws.internal") | select(.errorCode == "AccessDenied")'

getting:

A look at the event quickly reveals the issue — no tags in sight! In order to learn what happened with the tags, we go back to zcat * | jq .Records[] | less and search for EMRTag (" /EMRTag "), which is the tag we ran the create-cluster command with. We then find a CreateTags event:

and we conclude that tags aren’t created on instance creation, but later as a separate event — which dooms our idea of limiting by tag for the RunInstances API call.

However, the event information gives us some ideas on how to limit the permissions. We decided on limiting by security group ( RunInstances needs permissions on the security groups it's going to put the instances on) since these groups are exclusively for EMR -- so it can launch as many instances as it needs, as long as it places them in that groups.

A further example

As another example, after a while we found out that although EMR wasn’t erroring out on launch anymore, each run generated some failed CreateQueue SQS events. We quickly sought for events matching those:

zcat * | jq '.Records[] | select(.eventName == "CreateQueue") | select(.errorCode == "AccessDenied")' | less

getting:

Those are queues that are created and dropped by EMR when the cluster is launched with --enable-debugging -- although it's not fatal. Again, the event gives us an idea on how to give the proper permissions -- sqs.* on any queue named AWS-ElasticMapReduce-* , reserving those for EMR:

We’ve then managed to reduce a wide open permission (not limited by resource) to a permission over a specific set of resources (limited by SQS prefix).

After the change, we can search for the successful events (careful with the filter ordering in this one):

zcat * | jq '.Records[] | select(.eventName == "CreateQueue") | select(.errorCode == "AccessDenied" | not) | select(.requestParameters.queueName | contains("AWS-ElasticMapReduce"))' | less

and confirm we got it right:

A tool for weeding out unnecessary permissions

This raises the question — given a role, is it possible to restrict its permissions as much as possible in a (mostly) automated way?

We’ve been pondering on the idea of a tool that could launch a process and watch for its effects on CloudTrail and AccessAdvisor — even if the process isn’t fully automated, it could provide a listing of the exact resources accessed, making it easier for the operator to extract the minimal set of permissions from it.

A tool that comes close to this is Netflix’ Aadvark, which works as an aggregator and front-end for AccessAdvisor.

Conclusion

In the event of an error due to a missing IAM permission, we can find all the information needed for debugging in the CloudTrail log the failed event leaves behind.

Since a high AWS traffic can quickly lead to a large log in a short amount of time, we want tools for filtering the information we’re looking for. In this post, we’ve demonstrated how to filter through the JSON log combining two CLI tools (aws(1) and jq(1)); but any log aggregator with filtering capabilities can do the job as well.

[1] A notable exception is ack(1) which can’t work on compressed files.

[2] https://stedolan.github.io/jq/