Photo by Marten Bjork on Unsplash

If you happen to store structured data on AWS S3, chances are you already use AWS Athena. It is a hosted version of Facebook’s PrestoDB and provides a way of querying structured data (stored in say .csv or .json files) using standard ANSI SQL. Given its competitive pricing structure (5 USD for 1 TB of scanned data), it currently seems to be the best tool for digging through data saved in “cold storage” in S3 (as opposed to “hot storage” in for instance Amazon Redshift or some other analytical or transactional database).

Athena has some very nice features. It not only provides the query results in the AWS Console or as a response to an API request, but it also automatically saves them as a .csv file in S3. There is only one small caveat: although you can specify the path on S3, the actual filename will be a random Universally Unique IDentifier (UUID). This is a double edged sword as on the one hand it prevents overwriting query results by mistake, but on the other it means that saving query results to specific locations on S3 with specific file names cannot be done by Athena alone. Thankfully, Apache Airflow can help us with that.

In this quick post we will see how to:

Execute an Athena query Wait until it successfully completes Move the query results to a specific location on S3

All of that can be easily done using community-supplied Airflow operators,

very little code and some knowledge of Airflow — most of which you can find in the following sections.

Note: The following sections assume you already have Airflow installed and set up. If you do not, it may be a good idea to do that first, for instance by following the steps outlined in this blog post.

Executing Athena queries in Airflow

One of the great things about Apache Airflow is that just as in any other framework, the most common use cases are already covered, tested and ready to be used. Executing AWS Athena queries is no exception, as the newer versions of Airflow (at least 1.10.3 and onward) come pre-installed with a specific operator that covers this use case. You can easily spot it in the big list of Airflow operators by its “uninspiring” name: AWSAthenaOperator.

The AWSAthenaOperator only requires us to specify three parameters:

query : the Presto query to run

: the Presto query to run database : the database to run the query against

: the database to run the query against output_location: the S3 path to write the query results into

Armed with that information, we can put together a small Directed Acyclic Graph (DAG), using a few Airflow modules. With a bit of imagination it may look as follows:

A very simple DAG with a single task in it: an execution of AWS Athena query.

So what exactly is happening in this DAG? As we can see on lines 5 to 7, the DAG we defined is really minimal: all it has specified are the start date and the schedule interval (set to None , meaning this DAG will not be scheduled). Line 11 is a bit more interesting: it specifies the Athena/Presto query we would like to run. In our case that would be

SELECT * FROM UNNEST(SEQUENCE(0, 100))

It may look quite scary at the first sight but it really is not — it just lists numbers from 0 to 100 using Presto’s sequence function.

There really is not much else happening here, so provided we specified the database and output_location parameters correctly, we can just take this DAG and specifically the run_query task for a test run. In order to do so, however, Airflow needs to have a connection to AWS set up properly.

Intermezzo: Setting up connection to AWS with Airflow

To connect to AWS services from within Airflow, we need to obtain two values from AWS:

AWS_CLIENT_ID

AWS_CLIENT_SECRET

You can get both by creating your own IAM user access keys — a process nicely described in the AWS Docs.

Once we have both of these values, they need to be somehow passed to Airflow so that it can use them to authenticate requests to the AWS API. Although one can just fill them in a form inside Airflow’s Admin interface ( Menu -> Admin -> Connections -> aws_default ), it may be easier to pass these values to Airflow by exporting the AIRFLOW_CONN_AWS_DEFAULT environment variable. In our case, this may amount to executing the following command inside the terminal, from which we will execute all the other Airflow commands later on:

Note that in the command above, $AWS_CLIENT_ID and $AWS_CLIENT_SECRET should be replaced with the actual values obtained from the AWS Console. The $AWS_REGION should be changed to the region where the S3 bucket and the Athena instance you are trying to query reside.

Sadly, this sometimes may not be enough, in which case it seems that the safest approach is to just export all the relevant environment variables separately:

export AWS_DEFAULT_REGION=$AWS_REGION

export AWS_ACCESS_KEY_ID=$ AWS_CLIENT_ID

export AWS_SECRET_ACCESS_KEY=$ AWS_CLIENT_SECRET

Just as in the other option, the variables (the right hand side) in these commands should be replaced with actual values obtained from the AWS Console.

Once that has been taken care of, we can run the test by executing:

airflow test simple_athena_query run_query 2019–05–21

The test command will start the specified task (in our case run_query ) from a given DAG ( simple_athena_query in our example). The output will provide some useful details, such as what task is being executed, which connection details are being used and how is the task progressing.

Looking at the command above, we can see that we not only specified the DAG name and the ID of the task to run, but also the execution date ( 2019–05–21 ). Our run_query task definition does not make use of it, but we can very easily change that: the query parameter is templated and the execution date is exposed via the ds variable. Our updated DAG may then look as follows:

The DAG from above, featuring a ‘templated query’.

Not much has changed, only the line 11 now looks really scary:

SELECT * FROM UNNEST(SEQUENCE(DATE(‘2019–05–01’), DATE_TRUNC(‘day’, DATE(‘{{ ds }}’)), INTERVAL ‘1’ DAY))

This seems to be mostly due to the fact that the whole query is a complex oneliner — split into multiple lines the query looks at least a bit more readable:

SELECT

*

FROM UNNEST(SEQUENCE(DATE(‘2019–05–01’),

DATE_TRUNC(‘day’, DATE(‘{{ ds }}’)),

INTERVAL ‘1’ DAY))

Upon closer examination, we can see that what the query does is actually quite straightforward: it lists all dates starting from 1st of May, 2019 until the date when the task got executed (that is what {{ ds }} will be replaced with). In practical terms, this means that once we execute

airflow test simple_athena_query run_query 2019–05–21

and then look at the S3 location we specified in the task’s parameters ( s3://my-bucket/my-path/ in our case), we should find a .csv file with 21 lines in it: one for each day between the 1st and 21st of May, 2019.

The ability to condition task execution on external parameters (such as the

execution date for instance) is quite powerful. Considering just our

small example, we are now able to create a list of days until the end of

the month without changing a single line of code in our DAG. All we need to do is to run

airflow test simple_athena_query run_query 2019–05–31

and then head to the same S3 path as before. We should be able to find a .csv file with 31 lines there.

This really only scratches the surface of the capabilities of Airflow’s “templated strings” — you can read much more about their more advanced features in the docs.