Athena and CloudTrail: A Marriage made in the Cloud

One of the first things which came to mind when AWS announced AWS Athena at re:Invent 2016 was querying CloudTrail logs. Over the course of the past month, I have had intended to set this up, but current needs dictated I had to do it quickly. When I went looking at JSON imports for Hive/Presto, I was quite confused. Of course, as a trusty technologist I went to Google. Much to my surprise, no one had published an article about using Athena to do this, I was only able to locate EMR based posts which used a custom serde to support the nested CloudTrail format.

I had mild success at first, but thanks to some Athena guru’s, I was able to get the magic piece in place.

I have to provide credit to AWS for their help with a few issues and amazing documentation on the event types.

I have provided references at the end of each field section and the end of the post with specific and broader details for the event fields and their uses.

To set all of this up, you first must have your CloudTrail logs in a single S3 bucket, this will work with a single account or many, but I purposely set up delivery to a single bucket but I created a table per source in Athena under a common database.

This is an example create table which will provide the table/field sytax formats I used in the tables below.

CREATE EXTERNAL TABLE my_table_name ( Records ARRAY< STRUCT< eventName: STRING, requestParameters: STRUCT< instancesSet: STRUCT< items: ARRAY< STRUCT< instanceId: STRING >>>, volumeSet: STRUCT< items: ARRAY< STRUCT< volumeId: STRING > > > >, eventType: STRING, eventSource: STRING, sourceIPAddress: STRING, userIdentity: STRUCT< arn: STRING, principalId: STRING, accountId: STRING, invokedBy: STRING, TYPE: STRING, sessionContext: STRUCT< sessionIssuer: STRUCT< arn: STRING, principalId: STRING, accountId: STRING, TYPE: STRING, userName: STRING >, attributes: STRUCT< creationDate: STRING, mfaAuthenticated: STRING > > >, eventVersion: STRING, responseElements: STRUCT< credentials: STRUCT< accessKeyId: STRING, expiration: STRING, sessionToken: STRING >, assumedRoleUser: STRUCT< arn: STRING, assumedRoleId: STRING > >, userAgent: STRING, eventID: STRING, awsRegion: STRING, sharedEventID: STRING, eventTime: STRING, resources: ARRAY< STRUCT< accountId: STRING, TYPE: STRING, ARN: STRING > >, requestID: STRING, recipientAccountId: STRING >> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH serdeproperties( 'ignore.malformed.json' = 'true' ) LOCATION 's3://my_consolidated_bucket/my-cross-account-prefix/AWSLogs/'

CloudTrail Record Query Columns

These are the columns you can reference in your queries, I have grouped them by purpose. This is not a full list of all CloudTrail fields, so if you need others such as VpcEndpoint, you should add that to the schema.

Event ID Fields

record.eventID GUID generated by CloudTrail to uniquely identify each event record.sharedEventID GUID generated by CloudTrail to uniquely identify CloudTrail events from the same AWS action that is sent to different AWS accounts

Event Details

record.eventName The requested action, which is one of the actions in the API for that service. (example: DescribeLoadBalancers) record.eventSource The service that the request was made to (e.g. ec2.amazonaws.com) record.eventTime The date and time the request was made, in coordinated universal time (UTC) record.eventType Identifies the type of event that generated the event record, one of AwsApiCall, ConsoleSignin, AwsServiceEvent (related to the trail itself, this can occur when another account made a call with a resource that you own) record.eventVersion The version of the log event format record.sourceIPAddress The IP address that the request was made from, when console is used, it will report console.amazonaws.com

Request Details

record.requestId The value that identifies the request, generated by the service being called record.requestParameters The parameters, if any, that were sent with the request

Resource Details

record.resources An array of the resources accessed in the event, used most often by STS or KMS record.resources.accountId The account ID of the impacted element

Response Details

record.responseElements.assumedRoleUser.arn The arn of the assumed role for the unique session record.responseElements.assumedRoleUser.assumedRoleId The ID of the assumed role for the unique session record.responseElements.credentials.accessKeyId The access key of the caller record.responseElements.credentials.expiration The expiration of the current session record.responseElements.credentials.sessionToken The active token for the session References

References

http://docs.aws.amazon.com/IAM/latest/UserGuide/cloudtrail-integration.html#stscloudtrailexample http://docs.aws.amazon.com/kms/latest/developerguide/logging-using-cloudtrail.html

Miscellaneous

record.userAgent The agent through which the request was made record.recipientAccountId Represents the account ID that received this event, may differ from the calling account if cross-account access occurred and will differ on the "remote" end

User Identity

record.userIdentity.accountId The account that owns the entity that granted permissions for the request record.userIdentity.arn The Amazon Resource Name (ARN) of the principal that made the call record.userIdentity.invokedBy The name of the AWS service if that made the request record.userIdentity.principalId A unique identifier for the entity that made the call. For requests made with temporary security credentials, this value includes the session name that is passed to the AssumeRole, AssumeRoleWithWebIdentity, or GetFederationToken API call record.userIdentity.sessionContext.attributes.creationDate The date and time when the temporary security credentials were issued record.userIdentity.sessionContext.attributes.mfaAuthenticated The value is true if the root user or IAM user whose credentials were used for the request also was authenticated with an MFA device; otherwise, false record.userIdentity.sessionContext.sessionIssuer.accountId The account that owns the entity that was used to get credentials record.userIdentity.sessionContext.sessionIssuer.arn The internal ID of the entity that was used to get credentials record.userIdentity.sessionContext.sessionIssuer.type The source of the temporary security credentials, such as Root, IAMUser, or Role record.userIdentity.sessionContext.sessionIssuer.userName The friendly name of the user or role that issued the session. The value that appears depends on the sessionIssuer identity type. See reference material for more information record.userIdentity.type The type of the identity which is one of: Root, IAMUser, AssumedRole, FederatedIsr AWSAccount (cross-account access), AWSService (Access performed by an AWS service such as Elastic Beanstalk)

Reference: http://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-user-identity.html#cloudtrail-event-reference-user-identity-fields

Example Queries

Find all event names by ARN by IP address and count them up as the highest totals

SELECT record.eventName, record.userIdentity.arn, record.sourceIPAddress, COUNT(*) FROM (SELECT record FROM my_table_name CROSS JOIN UNNEST(records) AS t (record)) AS records GROUP BY record.eventName, record.userIdentity.arn, record.sourceIPAddress ORDER BY COUNT(*) DESC LIMIT 20;

Find all events where cross-account access occurred, group them by the source and the ARN and count the totals

SELECT record.eventName, record.eventSource, record.userIdentity.arn, COUNT(*) FROM (SELECT record FROM my_table_name CROSS JOIN UNNEST(records) AS t (record)) AS records WHERE record.recipientAccountId <> record.userIdentity.accountId GROUP BY record.eventName, record.eventSource, record.userIdentity.arn ORDER BY COUNT(*) DESC LIMIT 20;

Document Reference: http://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference.html