I’ve already written about a similar setup built up on Confluent tools solely, this is the Kafka broker and the Rest Proxy being the gateway for Google Analytics. I’ve also mentioned a few issues this setup faced. In this post we’re going to use Snowplow as the intermediate layer feeding Kafka. Additionally, we’ll add the Schema Registry, Kafka Streams and KSQL into the game to have a schema aware streaming platform allowing business-oriented users to get insight into users’ behavior, again, all in real-time.

Boilerplate

As mentioned, we’ll work with the whole Confluent thing: the broker, the Schema Registry, KSQL and also a Kafka Streams application. The binary can be downloaded from Confluent. In my case it was the Self Managed Confluent Platform version 5.3.0.

Google Analytics collector

We need a gateway receiving data from Google Analytics and passing it to Kafka. It turns out that Snowplow’s Scala Stream Collector is a perfect fit. Snowplow is not just about collecting and storing data. It’s rather a set of open source tools, like trackers, collectors, enrichers suitable to build a full-blown product analytics platform. Since our goal is to capture and massage the data before analyzing it with KSQL, the collector alone is sufficient.

Go to Snowplow’s Bintray artifacts repository and look for snowplow_scala_stream_collector_kafka . The last stable version as at the time of writing is 0.15.0, but there’s also 1.0.0_rc2. By looking at the CHANGELOG I have the impression that there’s no difference between these two and the version bump is just related to the release process of the whole Snowplow toolbox. Let’s go with 0.15.0, it will work just fine. Once the zip is extracted, the collector has to be configured. Create a configuration file called ga2kafka.conf :

I’m not going into details, what all these switches do, but marked each modified one with #TBD . Most of them are self explanatory anyway and the important ones are collector.port , cookie.domain , streams.good , streams.bad , sink.enabled and sink.brokers .

Once started the collector will listen on port 35353 for Google Analytics data and store it in the ga-success topic serialized in the 😱 Thrift 😱 format. But don’t worry, to work with Thrift you won’t have to install the thrift compiler which would be worth another blog post. I’ll come back to that later.

For developers working on a local machine

Since my computer is connected to the internet behind a router, I use localtunnel to expose my machine to the outer world:

It may happen that the tunnel is closed for no good reason. This script will work around this issue:

Website setup

Install Google Tag Manager on your site by adding a JavaScript snippet, as described:

Next add a Google Analytics tag and configure it. Go to the Tags section in the left menu and create a new Google Analytics — Universal Analytics tag that is triggered on all pages:

Finally create a JavaScript Variable responsible for sending GA data to Snowplow. Choose the Variables button from the left menu and create a new Custom JavaScript :

Here’s the custom JavaScript code:

The key thing here to note is the POST request to https://ga-replicator.localtunnel.me/com.google.analytics/v1 . This is where our Snowplow Google Analytics collector is listening for data.

The last step is to add this variable to your tag. Go to the Tags section to edit your GA tag. Check Enable overriding settings in this tag, expand More Settings and under Fields to Set add a Field Name called customTask and a Value {{GA Replicator}} :

Don’t forget to Submit and Publish your changes.

At this point, you should have a properly configured site. To verify that, inspect the page with Chrome Developer Tools and open the Network tab. You should see a POST request being sent to the collector.

Ready, steady, go!

Let’s examine the ga-success topic and try to understand the content:

This is the part where we need a translator between Snowplow’s Thrift format and Kafka’s Avro serialization to use the message later on in KSQL.

For this purpose I’ve created a Kafka Streams application called Thrift2Avro with a custom Thrift deserializer. The application could also perform other business logic, like filtering out messages where the IP address belongs to one of our QA clients.

Let’s look at the topology:

First the application creates a stream from the ga-success topic. Messages are consumed (or deserialized) with CollectorPayloadThriftSerdeProvider . The schema of the Snowplow message is available on Github. There are fields like ipAddress and userAgent just to name a few. The important thing is that we do not have to compile this IDL file with a thrift compiler. Fortunately the Snowplow team provides an artifact — com.snowplowanalytics:collector-payload-1:0.0.0 — we can include as a dependency in our project and use the compiled java class com.snowplowanalytics.snowplow.CollectorPayload.thrift.model1.CollectorPayload .

Looking further at the pipeline, we map each message into a generic Avro record extracting a subset of fields. The schema is for demo purposes quite limited containing only two fields. The most important part, the body field, would need to be split further by the & character.

The next step is the filterNot method which leaves only these messages, which do not satisfy the given condition — have the ip field set to a given value.

Finally the stream is materialized to the ga-parsed topic for further analytics.

Let’s see it in action:

All you need is KSQL

With such a setup we’re ready to run sql commands against our schema aware ga-parsed topic. Since the KSQL server is already running, let’s start the ksql client, stream the ga-parsed output topic and select all records. Then let’s select just these which start with 89 :

Summary

From a technical point of view, this exercise was all about getting data from Google Analytics into Kafka and transform the messages according to a given Avro schema, supported by KSQL.

From a business point of view though, this is just the beginning. In our project we combined (joined) messages from multiple sources, like the transactions and warehouse databases and finally either stored or rather materialized them in further topics to run interactive queries or exported the raw (Thrift formatted) messages into Google Cloud Platform’s Pub/Sub. There, another tool from the Snowplow family, the BigQuery Loader, has been used to move the data into BigQuery — Google’s warehouse, to process it further.