Every so often, I think about the poor guy at Google who adds hard drives to support Google Analytics. Seriously, whopping 58M websites send data to it every single day!

Excited about the 4th upgrade of Google Analytics code (gtag.js), I’ve went away and integrated it into a new project. Mind you, migration from previous analytics.js sucks and documentation still lacks.

I soon realized this version ain’t gonna cut it, for example:

You need to jump hoops to get feature-toggles/experiments stick with every event (which I ended up doing using custom dimensions) Session Duration (one of our most important KPI) cannot be sliced by custom dimensions Slicing is only allowed by 1 custom dimension at a time You can only get metric averages, not medians or percentiles Passing media costs in and optimizing ROI for various sources is like investigating a murder mystery Passing offline, 3rd party data and conversion pixels isn’t supported Only some data/views are real-time

For some of these shortcomings, I thought of hacking my way through by using Enhanced E-commerce tracking. This can receive fake “sales” for additional data and slicing option but didn’t feel right.

With my love for GCP and based on this wonderful tutorial I hereby give you the 3 hours walk through to a better Google Analytics for your business.

Note: this can be easily ported to AWS (see the end of the post).

Create a bucket to host your pixels

Call it however you’d like Grab a transparent favicon.ico from http://www.favicon.cc/?action=icon&file_id=393493 to avoid some 404s Grab a transparent png from http://www.1x1px.me/ (the one google suggests is x10 heavier) Upload your transparent pixel as visit (no extension) start, end, media and whatever other events you wish to track

Create a Load-Balancer to log calls to your pixel

Backend configuration should point to your bucket No host or path rules I use ephemeral IP and HTTP-only, but you can make this listen to HTTPS and hook up a domain with your own SSL certificate for cheap too I did not set this as CDN as the tutorial suggests nor did I find a reason to

Create a simple JavaScript logger

Without going into huge length, you want to generate URLs that starts with your load balancer address, followed by the event type and a concatenation of all dimensions and metrics The first and only event that should include information about your experiments and user profile should go to /visit?userId=. In this example I’m sending some screen info, inbound utm params and a list of experiments The remainder of events can carry just /event?userId= and any additional metrics. In this example I’m sending 2 metrics every 10 seconds If you can, use navigator.sendBeacon (see below for a polyfill) for better delivery, especially when the user navigates away from the page or even closes a tab Note navigator.sendBeacon uses POST, which apparently fails for GCS objects, but since it doesn’t expect any response it will work just fine Generate some sample events so you can see them accumulating as logs Obviously this can be ported to any environment that can send HTTP

Set up Logging Export to BigQuery

Create a filter that lists all logs by your newly created Load Balancer Optionally narrow it down to only the URL hits you want, excluding erroneous calls, hits to your favicon.ico, etc. Whatever you end up here will be imported to BigQuery in batch

Optionally fire your 3rd party data

Since you’ve essentially created a pixel server endpoint, nothing prevents you (or others) from sending additional data This data can be user-specific, i.e a conversion pixel would look like this: http://my.loadbalancer.address/conversion?userId=previousUserId&total=100&commission=10¤cy=EUR Or aggregated data, i.e media cost breakdown from one of your sources: http://my.loadbalancer.address/media?utmSource=adtecho&country=DE&clicks=1000&cost=50¤cy=GBP I’m guessing you would like to automate reports fetching at some point, i.e via cron jobs and curl

Query BigQuery

With some SQL-lovin’, you can now query user activity, averages, medians, standard deviations and percentiles This example shows how to join a user’s visit to subsequent “play” events (essentially querying the same table twice) BigQuery will automatically partition your events by date Remember BigQuery charges by the GB stored and TB queried, so if the number of events explodes consider expiring old data

Using AWS?

Concept is very much the same, just replace BigQuery with RedShift (yuck) or Athena (which also supports RegEx, thanks Dylan Sather for the tip!) S3 buckets can be set up to write logs and supports HTTPS (so no need to set up a Load Balancer)

And there you have it, full server-less SQL querying to your events which is blazing fast and almost real-time. Compared to Google Analytics 360 $150K/y and Mixpanel’s $999/10M events I would say this is pretty decent.

Sure, there’s no fancy-shmancy dashboard (though with some work you can hook up Data Studio), or drivers (it’s all HTTP, though you have to keep persistence and cookies) but what can be better than being in total control of your own data?

Update: Create an Interactive Dashboard