Filling the BigQuery paddling pool from the Kinesis Hosepipe

Posted by by Gareth Jones

Introduction

This blog post details how we solved the problem of analysing large amounts of HTTP request logs for one of our clients. Spoiler: we used Amazon’s Kinesis and Lambda to stream the data into Google’s BigQuery for analysis. Read on for the juicy details!

Background

Shine has a client that serves ads to a large network of sites. The servers we manage for them get a lot of traffic every day, and we have between 4 and 12 Apache httpd servers running according to demand, living in an autoscale group in Amazon’s EC2. Each server handles about 30 million requests in the course of a day, generating about 10Gb of log files. So for an average of 8 servers running, that comes to 80Gb per day. Or 2.5Tb per month. In the good old days, I would have illustrated this with a stack of CDs or floppies, but nobody knows what they are any more. It’s a lot, anyway.

Analysing this data is difficult. Each server archives its logs to an S3 bucket, and the current process is for someone to download those logs, unzip them, and work some sort of grep magic.

Step 1: Set up Kinesis

Now, we could use Splunk or fluentd or logstash or some other great service for doing this, but our client is familiar with BigQuery, they like the SQL interface, and they have other datasets stored there already. As a bonus, they could run their own reports instead of having to talk to developers (and nobody wants to do that, not even developers). So our task was to find a way to get the logs into BigQuery.

As usual, Amazon have already come up with a service that helps you stream large amounts of data reliably. Kinesis streams are configured and charged based on two things – the number of shards you need, and the volume of requests (in units of 25Kb). The number of shards needed is determined by the capacity of the shards; one shard can handle 1Mb/s input, with a max of 1000 events/s. So our daily rates above translated to per second rates (and assuming we have an average of 8 servers running) look like this:

log size: 80Gb / (24*60*60) = 0.95 Mb/s

events: 210m / (24*60*60) = 2430 events/s

So, although our events are small in size (one log line in Apache is only a few hundred bytes), there are a lot of them – we will need two shards to cope with this volume.

Kinesis prices in Australia are (at the time of writing):

$0.0184 per shard hour

$0.0195 per 1 million PUT units (25kb)

So, two shards works out to a monthly cost of $27 (2 x 24hours x 31days). Our 80Gb/day converted into 25Kb chunks is 3.4m PUT units per day, which works out to $2 a month (3.4 x 31days x $0.0195). So the total for the Kinesis part is just $29/month. Bargain.

By default, Kinesis will keep 24 hours of streamed data, but you can have up to 7 days retention if you ask nicely (and pay nicely, as well). We won’t need that.

Creating a Kinesis stream is really easy, in the AWS command-line:

aws kinesis create-stream --stream-name apache-logs --shard-count 1 --region us-west-2

For this setup, I’m only using one shard. Notice that even though we’re in Australia, and Kinesis is available in Australia, I’m setting this up in the US-West-2 region. I’ll explain why later.

Step 2: Kinesis-Agent on the servers

Now we need to find a way of getting the Apache server logs into the Kinesis stream. Once again, the boffins at Amazon have already thought of this and they provide the really handy kinesis-agent. Not only will this magical thing read the log file and send each line as a kinesis event, it even supports standard Apache log formats and will convert them into JSON for us. There are instructions in that link for installing the service.

Our configuration for the agent looked something like this:

{ &amp;quot;kinesis.endpoint&amp;quot;: &amp;quot;https://kinesis.us-west-2.amazonaws.com&amp;quot;, &amp;quot;flows&amp;quot;: [ { &amp;quot;filePattern&amp;quot;: &amp;quot;/var/log/apache2/access.log&amp;quot;, &amp;quot;kinesisStream&amp;quot;: &amp;quot;apache-logs&amp;quot;, &amp;quot;dataProcessingOptions&amp;quot;: [ { &amp;quot;optionName&amp;quot; : &amp;quot;LOGTOJSON&amp;quot;, &amp;quot;logFormat&amp;quot;: &amp;quot;COMBINEDAPACHELOG&amp;quot; } ] } ] }

There are other configuration options, like changing the batch size or to handle other file formats, but we went with the sensible defaults. The agent itself produces a log file you can use to check if everything is set up correctly.

Step 3: Lambada – the forbidden dance

We’ve got a giant hosepipe of fresh log events, now we need some way of processing them. Amazon’s Lambda is ideal for this scenario. We’re going to write a node.js lambda function that will accept events from our Kinesis stream. To start with, we’ll just log the events as they come in.

exports.handler = function (event, context, cb) { event.Records.forEach(record =&amp;gt; { var payload = new Buffer(record.kinesis.data, 'base64').toString('utf-8'); console.log(payload); }); cb(null, 'everything is just fine'); };

Lambda uses node v4.3. Kinesis event data is base64 encoded, so we have to convert it.

To allow our lambda function to access kinesis we need to create an execution role. Make a note of the ARN of the role, we’ll need that when creating the lambda function.

Zip up the javascript file, and the node_modules dependencies (if any). Then upload and create the function with a command like this:

aws lambda create-function --region us-west-2 --function-name kinesis-to-bigquery --zip-file fileb://$PWD/kinesis2bq.zip --role &amp;lt;arn of the execution role&amp;gt; --handler kinesis-to-bigquery.handler --runtime nodejs4.3

Note that I’m creating the lambda function in the US-West-2 region. Lambda is not available in Australia yet, and the lambda function must be in the same region as the Kinesis stream it reads from – this is why we created the stream in US-West-2. The command above assumes that your node code is in a file called “kinesis-to-bigquery.js” and the function exported is called “handler”.

The final step is to connect the kinesis stream to the lambda function. Here is the command-line you need:

aws lambda create-event-source-mapping --region us-west-2 --function-name kinesis-to-bigquery --event-source &amp;lt;arn of the kinesis stream&amp;gt; --batch-size 1000 --starting-position TRIM_HORIZON

To get the ARN of your Kinesis stream, you can run the following command:

aws kinesis describe-stream --stream-name apache-logs --region us-west-2

If everything has been set up correctly, our lambda function should be writing out kinesis events to the CloudWatch logs. Something like this:

Step 4: Big Google’s BigQuery For Big Data

For the last trick, we need to get these events into BigQuery so that we can query them big-ly. To do this, we’ll create a table to hold the events, with columns that match the JSON format of the events.

Create a service account for our lambda function to use. Download the credentials as a JSON file – you’ll need those in a minute. Grant the service account access to the dataset that contains the access logs table.

We’re going to use the official Google node.js API client library. There is another library that supports BigQuery, amongst other APIs, but that looked a little more difficult to make the changes I need for step 5 later. Here’s the code for bulk-inserting our events:

const uuid = require('node-uuid'); const google = require('googleapis'); const bigquery = google.bigquery('v2'); var key = require('./credentials.json'); var jwtClient = new google.auth.JWT(key.client_email, null, key.private_key, [&amp;quot;https://www.googleapis.com/auth/bigquery&amp;quot;, &amp;quot;https://www.googleapis.com/auth/bigquery.insertdata&amp;quot;], null); exports.handler = function (event, context, cb) { var rows = event.Records.map( record =&amp;gt; { return new Buffer(record.kinesis.data, 'base64').toString('utf-8'); } ).map( JSON.parse.bind(JSON) ).map( i =&amp;gt;; { return { insertId: uuid.v4(), json: i }; } ); jwtClient.authorize(function(err, tokens) { if (err) { cb(err); return; } bigquery.tabledata.insertAll({ auth: jwtClient, projectId: 'our project id', datasetId: 'ApacheWebserver', tableId: 'access', resource: { rows: rows } }, cb); }); };

The credentials.json file is the one I told you to download and remember when you created the service account. Each event is converted from Base64, parsed as JSON, then given a UUID. The BigQuery bulk insert API uses the insertId to reconcile repeat inserts (handy when your network connection temporarily fails and you retry the request).

Update the lambda code (remembering to add the credentials.json file and the node_modules directory to the zip) and we should be good to go. Here’s what ends up in our table:

Good work, team! High fives and fist bumps for everyone. But hang on – sending data out of Amazon and into Google incurs an egress charge from Amazon. I feel a terrible disturbance in my wallet. We were generating 2.5Tb/month remember? At $0.14/Gb, that’s about $350! Yikes. What can we do about this?

Step 5: GZIP All The Things!

The BigQuery API has support for gzipped responses, but what’s not obvious from the documentation is that it also has support for gzipped requests. If we can gzip our streaming inserts into BQ, that will cut down the data egress costs by about 70% (that figure is based on some really complicated analysis I did and not at all made up – honest). To do this we have to make a couple of little changes to our lambda function – and unfortunately a change to the googleapis library, because the library doesn’t support it.

Here’s our new lambda function:

const uuid = require('node-uuid'); const google = require('googleapis'); const bigquery = google.bigquery('v2'); const streams = require('stream'); const zlib = require('zlib'); var key = require('./credentials.json'); var jwtClient = new google.auth.JWT( /* this bit is the same as above */ ); exports.handler = function (event, context, cb) { var rows = event.Records.map( /* this bit is the same as above */ ); var stream = new streams.Readable(); stream._read = function(){}; //do nothing, avoids errors stream.push(JSON.stringify({ rows: rows })); stream.push(null); // end of stream. var gzippedStream = zlib.createGzip(); stream.pipe(gzippedStream); jwtClient.authorize(function(err, tokens) { if (err) { cb(err); return; } bigquery.tabledata.insertAll({ auth: jwtClient, projectId: 'our project Id', datasetId: 'ApacheWebserver', tableId: 'access', headers: { 'Content-encoding': 'gzip', 'Content-Type': 'application/json', 'User-Agent': '(gzip)' }, resource: gzippedStream }, cb); }); };

I’m now passing in a gzipped stream as the resource for the insertAll request, and setting a couple of HTTP headers to tell BigQuery to expect a gzipped request. To get this to work, I had to modify the lib/apirequest.js file in the googleapis library. Around line 161, I’ve added this:

if (isReadableStream(resource)) { options.json = true; body = resource; } else { options.json = resource || ( (options.method === 'GET' || options.method === 'DELETE') ? true : {} ); }

Which allows our gzipped stream to be piped to the request body. Doing all this results in gzipped requests being inserted into BigQuery. I’ll submit a pull request to the googleapis library to make this more straightforward.

Step 6: Profit!

In this case, it’s Amazon and Google that profit, but hey. Even more good work, team! This may require the highest of fives, perhaps involving a step-ladder. We’ve just taken logs from Apache servers on EC2, streamed them via Kinesis and a node.js Lambda function into Google’s BigQuery. It’s possible we could have used more Cloud technologies, but it’s difficult to see where. More importantly, our lovely client can now query their server logs in the interface they like without having to talk to any of us.