In one of our projects I was asked to capture Google Analytics data in real time and feed a Kafka topic with it for further processing and analyzing. As part of a proof of concept I’ve set up a Kafka Broker and the Confluent Rest Proxy shipped with the Confluent Platform. Additionally, a Google Analytics plugin (actually Custom JavaScript code) did the job of doing a POST request to the REST Proxy with all the analytics data.

Photo by DDP on Unsplash

1) Download, setup and start the Kafka environment

The binary can be downloaded from Confluent. In my case it was version 5.2.1.

To expose the REST Proxy to the world, I’m using localtunnel.me.

Since it occasionally fails with:

Error: connection refused: localtunnel.me:xxxxx (check your firewall settings)

I have a script for that:

2) Install Google Tag Manager on your site

The GTM allows to manage multiple JavaScript tags from one place. To install it on my site, I’ve added following scripts:

The next step is to add a Google Analytics tag and configure it. Go to the Tags section in the left menu and create a new Google Analytics — Universal Analytics tag that is triggered on all pages:

Finally create a JavaScript Variable responsible for sending GA data to the Rest Proxy. Choose the Variables button from the left menu and create a new Custom JavaScript :

Here’s the custom JavaScript code:

This function does a POST request with the GA payload to the REST Proxy. The Content-Type Header is required, otherwise you’ll get an HTTP 415 Unsupported Media Type error in response.

The last step is to add this variable to your tag. Go to the Tags section to edit your GA tag. Check Enable overriding settings in this tag , expand More Settings and under Fields to Set add a Field Name called customTask and a Value {{GA Replicator}} :

You’re ready to Submit the changes and Publish them:

At this point, you should have a properly configured site. To verify that, inspect the page with Chrome Developer Tools and open the Network tab. You should see a POST request being sent to the REST Proxy. But what you actually have is a single OPTIONS request:

This is so, because your browser is doing a CORS request. In short you’re executing an HTTP request from one origin (the web server, the JavaScript code is hosted at) to a server in a different origin, aka Cross-Origin Resource Sharing. The REST Proxy is not configured for this scenario, yet.

3) Enable CORS on the REST Kafka Proxy server

Stop it and edit etc/kafka-rest/kafka-rest.properties . Add following lines and restart the REST Proxy:

The * value is not a clever choice — you may want to restrict the domains having access to the REST Proxy server. In my case, that would be https://kijanowski.eu .

Four new WARN messages will show up, that can be safely ignored:

By the way, you can also get rid of the first ERROR message:

Either apply #538 or create this file and set its access rights properly.

Now if you refresh the page in your browser you will see one OPTIONS request and a second POST request with the GA payload:

The payload can be reviewed with the kafka-console-consumer :

Summary

Like with a first rendez-vous, it could have gone better ;)

Although this approach left us with raw Google Analytics data in a Kafka topic, it’s missing HTTP headers, user-agent information and the path that was opened, just to name a few. Although you can capture the two latter ones with an adjusted request.send command:

I had no luck to inject the HTTP headers into the payload.

To capture all this additional data, I’ve found a dedicated toolset from Snowplow Analytics quite useful. I’ll describe this approach in the next post.