Saturday morning hacks: Building an Analytics App with Flask

A couple years back I wrote about building an Analytics service with Cassandra. As fun as that project was to build, the reality was that Cassandra was completely unsuitable for my actual needs, so I decided to switch to something simpler. I'm happy to say the replacement app has been running without a hitch for the past 5 months taking up only about 20 MB of RAM! In this post I'll show how to build a lightweight Analytics service using Flask.

Analytics request/response cycle

The analytics service we'll be building will follow a blueprint popularized by Google Analytics. Here's how it works:

Each page we wish to track will include a <script> tag referencing a JavaScript file served by our analytics app (placed in your base template, for example).

tag referencing a JavaScript file served by our analytics app (placed in your base template, for example). Someone visits your site and their browser executes the JavaScript file.

The JavaScript contains code to read the current page's title, URL, as well as other interesting metadata.

Now the cool part , the script will dynamically create a new <img> element, specifying as it's src a URL served by our analytic's app.

, the script will dynamically create a new element, specifying as it's a URL served by our analytic's app. The page metadata we collected is encoded in the querystring of the new image's src attribute, which is in turn parsed by our analytics server.

attribute, which is in turn parsed by our analytics server. The analytics server adds a new row to the database and returns a 1-pixel gif.

Here is a diagram of the requests and responses:

Design considerations

Since this is running on a VPS with limited resources, and because my blog doesn't really receive that much traffic, we'll go with something lightweight and functional. I like the Flask framework for projects of all sizes, but it should work really well for this particular app. We'll also use peewee ORM for storing the page-views and, later in this post, running queries against our analytics data. All told our app will be less than 100 lines of code including comments!

Relational Database

In order to be able to easily run lots of ad-hoc queries, we'll use a relational database to store the page-view data. I chose to use SQLite because it is a lightweight embedded database, and won't take up too much RAM. If you're already running Postgresql or MySQL, then feel free to use them instead.

WSGI Server

There are a lot of options to choose from, but my preference is to use gevent. Gevent is a coroutine-based networking library that mixes lightweight threads (greenlets) with libev's event loop. Through the use of some pretty deep monkey-patching, gevent turns your normal, blocking python code into non-blocking without any special syntax or APIs (just one big monkeypatch). Gevent's WSGI server, while pretty basic, provides solid performance with very low overhead. As with the database, if you're already running something else or are familiar with a different library, feel free to use that instead.

Creating the virtualenv

Begin by creating a new virtualenv for the analytics app and installing flask and peewee (and optionally, gevent ):

$ virtualenv analytics New python executable in analytics/bin/python2 Also creating executable in analytics/bin/python Installing setuptools, pip...done. $ cd analytics/ $ source bin/activate $ pip install flask peewee ... ... Successfully installed flask peewee Werkzeug Jinja2 itsdangerous markupsafe Cleaning up... $ pip install gevent # Optional.

Implementing the Flask App

Let's start by creating the skeleton of our Flask app. As discussed, there will be two views: one to serve the JavaScript file, and one to serve the 1-pixel GIF. In the analytics directory, create a new file analytics.py and add the following lines of code. This code specifies the boilerplate for our application as well as some configuration values:

from base64 import b64decode import datetime import json import os from urlparse import parse_qsl , urlparse from flask import Flask , Response , abort , request from peewee import * # 1 pixel GIF, base64-encoded. BEACON = b64decode ( 'R0lGODlhAQABAIAAANvf7wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==' ) # Store the database file in the app directory. APP_DIR = os . path . dirname ( __file__ ) DATABASE_NAME = os . path . join ( APP_DIR , 'analytics.db' ) DOMAIN = 'http://127.0.0.1:5000' # TODO: change me. # Simple JavaScript which will be included and executed on the client-side. JAVASCRIPT = '' # TODO: add javascript implementation. # Flask application settings. DEBUG = bool ( os . environ . get ( 'DEBUG' )) SECRET_KEY = 'secret - change me' # TODO: change me. app = Flask ( __name__ ) app . config . from_object ( __name__ ) database = SqliteDatabase ( DATABASE_NAME , pragmas = { 'journal_mode' : 'wal' , # WAL-mode for better concurrent access. 'cache_size' : - 32000 }) # 32MB page cache. class PageView ( Model ): # TODO: add model definition. class Meta : database = database @app . route ( '/a.gif' ) def analyze (): # TODO: implement 1pixel gif view. @app . route ( '/a.js' ) def script (): # TODO: implement javascript view. @app . errorhandler ( 404 ) def not_found ( e ): return Response ( 'Not found.' ) if __name__ == '__main__' : database . create_tables ([ PageView ], safe = True ) app . run ()

As you can see we have two simple views: one to serve the JavaScript, and one to analyze the data sent back from the visitor's browser and serve a 1-pixel GIF.

Getting information from the Browser

Let's begin with the JavaScript that will run on the client-side. This code will extract some basic information from the page:

The URL of the page, including the querystring parameters ( document.location.href ).

). The page's title ( document.title ).

). The referring page's URL, if it exists ( document.referrer ).

There are other attributes we could also extract, which you can add if you're interested, such as:

The cookie key/value pairs ( document.cookie ).

). The last-modified date of the document ( document.lastModified ).

). And more.

After extracting the information, we will pass it to the analyze view in the query-string. For simplicity, we will have the JavaScript execute immediately once it is loaded by the visitor's browser, so we will wrap everything in a self-invoking anonymous function. Finally, we will use the browser's encodeURIComponent function to make values safe for passing through the query-string:

( function () { var img = new Image , url = encodeURIComponent ( document . location . href ), title = encodeURIComponent ( document . title ), ref = encodeURIComponent ( document . referrer ); img . src = '%s/a.gif?url=' + url + '&t=' + title + '&ref=' + ref ; })();

We've left a placeholder using the Python string interpolation parameter %s to allow our app to pass in the DOMAIN configuration value.

Replace the JAVASCRIPT configuration value in your application file with the following "minified" version of the above JavaScript code:

# Simple JavaScript which will be included and executed on the client-side. JAVASCRIPT = """(function(){ var d=document,i=new Image,e=encodeURIComponent; i.src=' %s /a.gif?url='+e(d.location.href)+'&ref='+e(d.referrer)+'&t='+e(d.title); })()""" . replace ( '

' , '' )

We can now fill in the script view to serve our javascript file:

@app . route ( '/a.js' ) def script (): return Response ( app . config [ 'JAVASCRIPT' ] % ( app . config [ 'DOMAIN' ]), mimetype = 'text/javascript' )

Storing the page-view data

The script we wrote will send three values to the analyze view, containing the page's URL, title, and referring page. We can now fill in the PageView model definition to store this data.

On the server-side, we will also be able to access the visitor's IP address and the request headers sent by the visitor's browser, so we will add columns for those values as well as the timestamp indicating when the request was made.

Since each browser may send a different collection of headers, and each page may have a different set of querystring parameters, we will store these as JSON in a TextField. If you're using Postgresql, you could also use HStore or the native JSON data-type.

Here is the definition of the PageView model, along with a simple JSONField suitable for storing the query-string parameters and request headers:

class JSONField ( TextField ): """Store JSON data in a TextField.""" def python_value ( self , value ): if value is not None : return json . loads ( value ) def db_value ( self , value ): if value is not None : return json . dumps ( value ) class PageView ( Model ): domain = CharField () url = TextField () timestamp = DateTimeField ( default = datetime . datetime . now , index = True ) title = TextField ( default = '' ) ip = CharField ( default = '' ) referrer = TextField ( default = '' ) headers = JSONField () params = JSONField () class Meta : database = database

Now we can add a method to the PageView model which will extract all the relevant values from the request. The urlparse module contains helpful functions for extracting portions of the request, and we will use this to extract the visitor's URL and the querystring parameters:

class PageView ( Model ): # ... field definitions ... @classmethod def create_from_request ( cls ): parsed = urlparse ( request . args [ 'url' ]) params = dict ( parse_qsl ( parsed . query )) return PageView . create ( domain = parsed . netloc , url = parsed . path , title = request . args . get ( 't' ) or '' , ip = request . headers . get ( 'X-Forwarded-For' , request . remote_addr ), referrer = request . args . get ( 'ref' ) or '' , headers = dict ( request . headers ), params = params )

The final step will be to fill in the analyze view. This view will create a new PageView and return a 1-pixel GIF. As a safeguard, we will check for the presence of a URL in the querystring to ensure we don't accidentally create blank rows:

@app . route ( '/a.gif' ) def analyze (): if not request . args . get ( 'url' ): abort ( 404 ) with database . transaction (): PageView . create_from_request () response = Response ( app . config [ 'BEACON' ], mimetype = 'image/gif' ) response . headers [ 'Cache-Control' ] = 'private, no-cache' return response

Running the app

If you'd like to test out the app at this point, you can run it in debug mode by specifying DEBUG=1 on the command-line:

(analytics) $ DEBUG = 1 python analytics.py * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) * Restarting with reloader

You can view the javascript by loading http://127.0.0.1:5000/a.js . If you have another web-app running locally, you can add the following tag to one of the pages to test the analytics app:

< script src = "http://127.0.0.1:5000/a.js" type = "text/javascript" ></ script >

To deploy the app to a product environment, I'd suggest looking into using a dedicated WSGI server. I like using gevent because it is extremely lightweight and provides great performance. You can modify analytics.py to serve requests using gevent instead of the Flask server. The following code will run the analytics app on port 5000 using gevent:

if __name__ == '__main__' : from gevent.wsgi import WSGIServer WSGIServer (( '' , 5000 ), app ) . serve_forever ()

Because gevent uses monkey-patching to achieve it's high concurrency, it is necessary to add the following line to the very top of the analytics.py file:

from gevent import monkey ; monkey . patch_all ()

Querying the Data

The real fun begins after you've started to collect data for a couple days and can run queries on it. In this section we'll look at some interesting ways we can query the data collected by the analytics app.

Using data from my blog, we'll run some queries on the past seven days of traffic:

>>> from analytics import * >>> import datetime >>> week_ago = datetime . date . today () - datetime . timedelta ( days = 7 ) >>> base = PageView . select () . where ( PageView . timestamp >= week_ago )

First off, let's see how many page-views I got during the past week:

>>> base . count () 1133

How many different IPs visited my site?

>>> base . select ( PageView . ip ) . group_by ( PageView . ip ) . count () 850

What are the top 10 pages?

print ( base . select ( PageView . title , fn . Count ( PageView . id )) . group_by ( PageView . title ) . order_by ( fn . Count ( PageView . id ) . desc ()) . tuples ())[: 10 ] # Prints... [( 'Postgresql HStore, JSON data-type and Arrays with Peewee ORM' , 88 ), ( "Describing Relationships: Django's ManyToMany Through" , 73 ), ( 'Using python and k-means to find the dominant colors in images' , 66 ), ( 'SQLite: Small. Fast. Reliable. Choose any three.' , 58 ), ( 'Using python to generate awesome linux desktop themes' , 54 ), ( "Don't sweat the small stuff - use flask blueprints" , 51 ), ( 'Using SQLite Full-Text Search with Python' , 48 ), ( 'Home' , 47 ), ( 'Blog Entries' , 46 ), ( 'Django Patterns: Model Inheritance' , 44 )]

During what four hour period of the day do I receive the most traffic?

hour = fn . date_part ( 'hour' , PageView . timestamp ) / 4 id_count = fn . Count ( PageView . id ) print ( base . select ( hour , id_count ) . group_by ( hour ) . order_by ( id_count . desc ()) . tuples ())[:] [( 3 , 208 ), ( 2 , 201 ), ( 0 , 194 ), ( 1 , 183 ), ( 4 , 178 ), ( 5 , 169 )]

Based on these numbers, it looks like I get most of my traffic mid-day around lunch-time, and the least amount of traffic in the late evening before midnight, but overall traffic is fairly even.

What are some of the most popular user-agents?

from collections import Counter c = Counter ( pv . headers . get ( 'User-Agent' ) for pv in base ) print c . most_common ( 5 ) [( u 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36' , 81 ), ( u 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36' , 70 ), ( u 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0' , 50 ), ( u 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2' , 37 ), ( u 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0' , 37 )]

It's basically up to you what you want to do with the data. One fun query will generate a list of all the pages, in order, that were visited by a particular IP address. This can shed some light on how people browse your site from page-to-page:

inner = base . select ( PageView . ip , PageView . url ) . order_by ( PageView . timestamp ) query = ( PageView . select ( PageView . ip , fn . GROUP_CONCAT ( PageView . url ) . alias ( 'urls' )) . from_ ( inner . alias ( 't1' )) . group_by ( PageView . ip ) . order_by ( fn . Count ( PageView . url ) . desc ()) print { pv . ip : pv . urls . split ( ',' ) for pv in query [: 10 ]} # Prints something like the following: { u 'xxx.xxx.xxx.xxx' : [ u '/blog/peewee-was-baroque-so-i-rewrote-it/' , u '/blog/peewee-was-baroque-so-i-rewrote-it/' , u '/blog/' , u '/blog/postgresql-hstore-json-data-type-and-arrays-with-peewee-orm/' , u '/blog/search/' , u '/blog/the-search-for-the-missing-link-what-lies-between-sql-and-django-s-orm-/' , u '/blog/how-do-you-use-peewee-/' ], u 'xxx.xxx.xxx.xxx' : [ u '/blog/dont-sweat-small-stuff-use-flask-blueprints/' , u '/' , u '/blog/' , u '/blog/migrating-to-sqlite/' , u '/blog/' , u '/blog/saturday-morning-hacks-revisiting-the-notes-app/' ], u 'xxx.xxx.xxx.xxx' : [ u '/blog/using-python-to-generate-awesome-linux-desktop-themes/' , u '/' , u '/blog/' , u '/blog/customizing-google-chrome-s-new-tab-page/' , u '/blog/-wallfix-using-python-to-set-my-wallpaper/' , u '/blog/simple-botnet-written-python/' ], # etc... }

Ideas for improving the app

Build a web interface or API for querying the pageview data-set.

Normalize the request headers using either a join table or something like Postgresql HStore (or JSONB if you're using 9.4).

Collect user cookies and track users between visits.

Use a GeoIP tool to identify users' locations based on their IP.

Implement canvas fingerprinting to better identify unique visitors.

Write more cool queries to extract data about your audience!

Thanks for reading

Thanks for taking the time to read this post, I hope you found it interesting! Feel free to leave a comment below or contact me if you have any questions.

You can find the source code for the analytics app and the "reports" hosted in this GitHub gist.

Links

If you enjoyed this post and are looking for more projects like this, check out the list of saturday-morning hack posts.

Commenting has been closed, but please feel free to contact me