AWS Athena is a schema on read platform. When it was introduced, there are many restrictions. But now you can use Athena for your production Data Lake solutions. Its using Presto clusters in the backend. Right now it supports external tables only which means you can create the tables on top of the flat files which are stored in S3. One more strong reason for suggesting Athena is its a Serverless service from AWS.

Few recent updates from Athena:

You can run 20 concurrent queries.

Auto complete queries on the Athena console.

Automatically creates the tables for CloudTrail log from the Cloudtrail console.

Offload intermediate data to disk for memory intensive queries.

Presto 0.172 functions, Operations and Lambda expressions are supports.

For more updates about Athena hit here.

Partitions:

AWS strongly recommends to use partitions on a data set. Querying the data on a huge data set without partition will take more time to execute and it’ll scan a lot amounts of data. This will lead to addition cost in your billing. Partitions will make your queries run faster.

CloudTrail is generating vast amount of data and store it in S3. Lets assume if we have 5 years of data and we need to know some information from past 2 months then it’ll take upto 30mins, also it’ll scan TeraBytes of data to find the results. In this case we need to partition the CloudTrail logs till today.

The Challenge:

Its fine to create the partitions on existing data. But the logs will be coming everyday. So somehow we need to create the partitions for each and every day. We can use AWS Cli, SDK or Lambda to automate this process. But we need to make sure the job that we scheduled should not be in single point of failure. So we used Lambda to automate this.

Pre-Requirements:

Needs to create the partitions for till today. And the lambda will start creating the partitions by current date +1 (create partition for tomorrow’s date).

IAM role for read cloudtrail data, write Athena results into S3 and Create, Execute permission for Athena queries.

Our environment for this PoC:

CloudTrail logs Bucket: cloudtrail-logs

cloudtrail-logs Athena query results bucket: aws-athena-query-results-XXXXXXXXXXXXXX-us-east-1

aws-athena-query-results-XXXXXXXXXXXXXX-us-east-1 Athena database: athena_log_database

athena_log_database CloudTrail logs table: cloudtrail_logs_table

IAM Policy for Lambda Role:

Create an IAM role and attach the below inline policy.

{

"Version": "2012-10-17",

"Statement": [

{

"Sid": "VisualEditor0",

"Effect": "Allow",

"Action": "s3:ListBucket",

"Resource": [

"arn:aws:s3:::aws-athena-query-results-XXXXXXXXXX-us-east-1",

"arn:aws:s3:::sqladmin-cloudtrail"

]

},

{

"Sid": "VisualEditor1",

"Effect": "Allow",

"Action": "s3:PutObject",

"Resource": "arn:aws:s3:::aws-athena-query-results-XXXXXXXXXXXXXXXX-us-east-1/*"

},

{

"Sid": "VisualEditor2",

"Effect": "Allow",

"Action": [

"s3:GetObjectAcl",

"s3:GetObject",

"s3:GetObjectTagging",

"s3:GetBucketPolicy"

],

"Resource": [

"arn:aws:s3:::cloudtrail-logs",

"arn:aws:s3:::cloudtrail-logs/*"

]

},

{

"Sid": "VisualEditor3",

"Effect": "Allow",

"Action": [

"athena:StartQueryExecution",

"athena:CreateNamedQuery",

"athena:RunQuery"

],

"Resource": "*"

}

]

}

Create the Athena table for CloudTrail log:

Create the Lambda Function:

Go to AWS Lambda console and create a new Python 3.6 function.

Select IAM Role which we created in the previous step.

Allocate 128MB Memory for this functions.

And set 30secs timeout.

Parameters needs to change in the lambda function:

s3_buckcet — Bucket name where your cloudtrail logs stored. s3_prefix — Path for your cloudtrail logs (give the prefix before the regions. For eg: s3://bucket/AWSLogs/AccountID/Cloudtrail/regions/year/month/day/log_files. So you need to use path: AWSLogs/AccountID/Cloudtrail/ ). s3_ouput — Path for where your Athena query results need to be saved. database — Name of the DB where your cloudtrail logs table located. table_name — Name of the table where your cloudtrail logs table located.

Lambda function code:

Debugging:

Since this function will not tell you whether the Athena queries are successfully executed or not. It just print the query IDs. If you want to debug this function,

Comment the line 103 [run_query(query, database, s3_ouput].

Remove comment from line 101 and 102 [print(get-regions), print(query)].

Happy querying! :)