AWS CloudTrail tracks API calls made in one’s account and the all these calls are logged or can be analyzed. The output files are typically json formatted. I wrote another article here which provides some background why I needed to perform an investigative work 🙂 on these files to identify an offending application. In short, AWS deprecated some of the API calls and any application that made these calls were to be migrated.

These logs had multiple json objects in single line! And with more the API calls more the data and the number of output files. In one case I had more than 5,000 files generated over couple of days. At the client site I couldn’t get much help on code base since it was old Java code written by an outsourced company which had moved-on.

Then it became an exercise for me to use Apache Drill for the above scenario. First I took a single file and ran a Drill query:

0: jdbc:drill:zk=local> select T.jRec.eventSource, T.jRec.eventName, T.jRec.awsRegion, T.jRec.sourceIPAddress, count(*) from (select FLATTEN(Records) jRec from dfs.`/cloudtrail_logs/144702NNNNNN_CloudTrail_us-east-1_20160711T2345Z_CJPTqBCGPPc1Bhqc.json`) T group by T.jRec.eventSource, T.jRec.eventName, T.jRec.awsRegion, T.jRec.sourceIPAddress order by EXPR$1;

Note the “DescribeJobFlows” call, the API call of interest to me and in the image below it is 4th from the top under column “EXPR$1”.

It is so cool to perform the similar query on multiple files with simple wildcards! The query parsed more than 4,000 files in little over a 30 secs on a single node with high load! The query spit out the following – 1958 deprecated calls made in couple of days.

Once I knew the region, host IP, application it was easy to nail down the shell script that was kicking off an EMR instances that used old jar code.

In this case it was much easier to perform analysis even compared to Spark. For example, Spark expects single json record per line in the files and hence needs some preprocessing before it is fed with this data.

Note: Cloudtrail’s json data structure below

{ "eventVersion": "1.03", "userIdentity": { "type": "IAMUser", "principalId": "A----------------G", "arn": "arn:aws:iam::1447NNNNNNNN:user/udxx_prox", "accountId": "1447NNNNNNNNN", "accessKeyId": "A-----------------A", "sessionContext": { "attributes": {}, "sessionIssuer": {} }, "userName": "udxx_prox" }, "eventTime": "2016-07-11T23:49:40Z", "eventSource": "s3.amazonaws.com", "eventName": "GetBucketAcl", "awsRegion": "us-east-1", "sourceIPAddress": "AWS Internal", "userAgent": "[aws-internal/3]", "requestParameters": { "instanceGroupTypes": [], "instanceIdentity": {}, "bucketName": "udms-prod", "objectIds": [] }, "requestID": "74AAE30BXXXXXXXX", "eventID": "ec769df9-833f-4f1d-90cd-830ff9b9ff43", "eventType": "AwsApiCall", "recipientAccountId": "144NNNNNNNN", "responseElements": { "clusters": [] }, "additionalEventData": { "vpcEndpointId": "vpce-2a2ed643" } }