Django 2.0 was released recently and among the most exciting things for me is support for Window expressions, which allows adding an OVER clause to querysets. We will use Window expressions to analyze the commits data to the Django repo.

So what is an over clause?

An over clause is of this format

SELECT depname , empno , salary , avg ( salary ) OVER ( PARTITION BY depname ) FROM empsalary ;

Compare this to a similar GROUP BY statement

SELECT depname , avg ( salary ) FROM empsalary GROUP BY depname ;

The difference is a GROUP BY has as many rows as grouping elements, here number of depname. An over clause adds the the aggregated result to each row of the select.

Postgres documentation says, "A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result." This is true for all other DB implementation as well.

What are real world uses of over expressions?

We will use the Django ORM with the Window expression to to some analysis on the most prolific committers to Django. To do this we will export the commiter names and time of commit to a csv.

git log --no-merges --date = iso --pretty = format: '%h|%an|%aI' > commits.iso.csv

This is not ranking of Django developers, just of their number of commits, which allows us an interestig dataset. I am grateful to everyone who has contributed to Django - they have made my life immesureably better.

With some light data wrangling using Pandas, we transform this to a per author, per year data and import to Postgres. Our table structure looks like this.

experiments =# \ d commits_by_year ; Table "public.commits_by_year" Column | Type | Modifiers ---------------+---------+----------- id | bigint | author | text | commit_year | integer | commits_count | integer |

We define a model to interact with this table.

from django.db import models class Committer ( models . Model ): author = models . CharField ( max_length = 100 ) commit_year = models . PositiveIntegerField () commits_count = models . PositiveIntegerField () class Meta : db_table = 'commits_by_year'

Lets quickly test if our data is imported. You can get a csv from here, and import to Postgres to follow along.

In [ 2 ]: Committer . objects . all (). count () Out [ 2 ]: 2318

Let us setup our environment and get the imports we need.

## Some ORM imports which we are going to need from django.db.models import Avg , F , Window from django.db.models.functions import Rank , DenseRank , CumeDist from django_commits.models import Committer # We will use pandas to display the queryset in tanular format import pandas pandas . options . display . max_rows = 20 # An utility function to display querysets def as_table ( values_queryset ): return pandas . DataFrame ( list ( values_queryset ))

Lets quickly look at the data we have.

as_table ( Committer . objects . all (). values ( "author" , "commit_year" , "commits_count" ))

author commit_year commits_count 0 Tim Graham 2017 373 1 Sergey Fedoseev 2017 158 2 Mariusz Felisiak 2017 113 3 Claude Paroz 2017 102 4 Mads Jensen 2017 55 5 Simon Charette 2017 40 6 Jon Dufresne 2017 33 7 Anton Samarchyan 2017 27 8 François Freitag 2017 17 9 Srinivas Reddy Thatiparthy 2017 14 ... ... ... ... 2308 Malcolm Tredinnick 2006 175 2309 Georg Bauer 2006 90 2310 Russell Keith-Magee 2006 86 2311 Jacob Kaplan-Moss 2006 83 2312 Luke Plant 2006 20 2313 Wilson Miner 2006 12 2314 Adrian Holovaty 2005 1015 2315 Jacob Kaplan-Moss 2005 130 2316 Georg Bauer 2005 112 2317 Wilson Miner 2005 20 2318 rows × 3 columns

We will now use the Window expression to get the contributors ranked by number of commits, within each year. We will go over the code in detail, but lets look at the queryset and results.

# Find out who have been the most prolific contributors # in the years 2010-2017 dense_rank_by_year = Window ( expression = DenseRank (), partition_by = F ( "commit_year" ), order_by = F ( "commits_count" ). desc () ) commiters_with_rank = Committer . objects . filter ( commit_year__gte = 2010 , commits_count__gte = 10 ). annotate ( the_rank = dense_rank_by_year ). order_by ( "-commit_year" , "the_rank" ). values ( "author" , "commit_year" , "commits_count" , "the_rank" ) as_table ( commiters_with_rank )

author commit_year commits_count the_rank 0 Tim Graham 2017 373 1 1 Sergey Fedoseev 2017 158 2 2 Mariusz Felisiak 2017 113 3 3 Claude Paroz 2017 102 4 4 Mads Jensen 2017 55 5 5 Simon Charette 2017 40 6 6 Jon Dufresne 2017 33 7 7 Anton Samarchyan 2017 27 8 8 François Freitag 2017 17 9 9 Srinivas Reddy Thatiparthy 2017 14 10 ... ... ... ... ... 171 Joseph Kocherhans 2010 53 11 172 Ramiro Morales 2010 53 11 173 Jacob Kaplan-Moss 2010 42 12 174 Chris Beaven 2010 29 13 175 Malcolm Tredinnick 2010 26 14 176 Honza Král 2010 20 15 177 Carl Meyer 2010 17 16 178 Ian Kelly 2010 17 16 179 Simon Meers 2010 11 17 180 Gary Wilson Jr 2010 10 18 181 rows × 4 columns

Lets look a the the ORM code in more detail here.

# We are creating the Window function part of our SQL query here dense_rank_by_year = Window ( # We want to get the Rank with no gaps expression = DenseRank (), # We want to partition the queryset on commit_year # Each distinct commit_year is a different partition partition_by = F ( "commit_year" ), # This decides the ordering within each partition order_by = F ( "commits_count" ). desc () ) commiters_with_rank = Committer . objects . filter ( commit_year__gte = 2010 , commits_count__gte = 10 # Standard filter oprtation, limit rows to 2010-2017 ). annotate ( # For each commiter, we are annotating its rank the_rank = dense_rank_by_year ). order_by ( "-commit_year" , "the_rank" ). values ( "author" , "commit_year" , "commits_count" , "the_rank" ) as_table ( commiters_with_rank )

Now lets try getting the average commits per commiter for each year along with the other data.

avg_commits_per_year = Window ( # We want the average of commits per committer, with each partition expression = Avg ( "commits_count" ), # Each individual year is a partition. partition_by = F ( "commit_year" ) ) commiters_with_yearly_average = Committer . objects . filter (). annotate ( avg_commit_per_year = avg_commits_per_year ). values ( "author" , "commit_year" , "commits_count" , "avg_commit_per_year" ) # We could have done further operation with avg_commit_per_year # Eg: F(commits_count) - F(avg_commit_per_year), # would tell us committers who commit more than average as_table ( commiters_with_yearly_average )

This gives us

author avg_commit_per_year commit_year commits_count 0 Wilson Miner 319.250000 2005 20 1 Adrian Holovaty 319.250000 2005 1015 2 Jacob Kaplan-Moss 319.250000 2005 130 3 Georg Bauer 319.250000 2005 112 4 Russell Keith-Magee 188.571429 2006 86 5 Jacob Kaplan-Moss 188.571429 2006 83 6 Luke Plant 188.571429 2006 20 7 Wilson Miner 188.571429 2006 12 8 Adrian Holovaty 188.571429 2006 854 9 Malcolm Tredinnick 188.571429 2006 175 ... ... ... ... ... 2308 Adam Johnson 4.916084 2017 13 2309 Tom 4.916084 2017 13 2310 Srinivas Reddy Thatiparthy 4.916084 2017 14 2311 François Freitag 4.916084 2017 17 2312 Anton Samarchyan 4.916084 2017 27 2313 Jon Dufresne 4.916084 2017 33 2314 Simon Charette 4.916084 2017 40 2315 Mads Jensen 4.916084 2017 55 2316 Claude Paroz 4.916084 2017 102 2317 Mariusz Felisiak 4.916084 2017 113 2318 rows × 4 columns

You could try other Window functions such as CumeDist, Rank or Ntile.

from django.db.models.functions import CumeDist cumedist_by_year = Window ( expression = CumeDist (), partition_by = F ( "commit_year" ), order_by = F ( "commits_count" ). desc () ) commiters_with_rank = Committer . objects . filter ( commit_year__gte = 2010 , commits_count__gte = 10 ). annotate ( cumedist_by_year = cumedist_by_year ). order_by ( "-commit_year" , "the_rank" ). values ( "author" , "commit_year" , "commits_count" , "cumedist_by_year" ) as_table ( commiters_with_rank )

Until now, we have partitioned on commit_year. We can partition on other fields too. We will partition on author to find out how their contributions have changed over the years using the Lag window expression.

from django.db.models.functions import Lag from django.db.models import Value commits_in_previous_year = Window ( expression = Lag ( "commits_count" , default = Value ( 0 )), partition_by = F ( "author" ), order_by = F ( "commit_year" ). asc (), ) commiters_with_pervious_year_commit = Committer . objects . filter ( commit_year__gte = 2010 , commits_count__gte = 10 ). annotate ( commits_in_previous_year = commits_in_previous_year ). order_by ( "author" , "-commit_year" ). values ( "author" , "commit_year" , "commits_count" , "commits_in_previous_year" ) as_table ( commiters_with_pervious_year_commit )

author commit_year commits_count commits_in_previous_year 0 Adam Chainz 2016 42 12 1 Adam Chainz 2015 12 0 2 Adam Johnson 2017 13 0 3 Adrian Holovaty 2012 40 98 4 Adrian Holovaty 2011 98 72 5 Adrian Holovaty 2010 72 0 6 Akshesh 2016 31 0 7 Alasdair Nicol 2016 13 19 8 Alasdair Nicol 2015 19 17 9 Alasdair Nicol 2013 17 0 ... ... ... ... ... 171 Timo Graham 2012 13 70 172 Timo Graham 2011 70 60 173 Timo Graham 2010 60 0 174 Tom 2017 13 0 175 Unai Zalakain 2013 17 0 176 Vajrasky Kok 2013 14 0 177 areski 2014 15 0 178 eltronix 2016 10 0 179 wrwrwr 2014 21 0 180 Łukasz Langa 2013 15 0 181 rows × 4 columns

I hope this tutorial has been helpful in understanding the window expressions. While still not as felxible as SqlAlchemy, Django ORM has become extremely powerful with recent Django releases. Stay tuned for more advanced ORM tutorials.

Thank you for reading the Agiliq blog. This article was written by shabda on Dec 6, 2017 in django . You can subscribe ⚛ to our blog. We love building amazing apps for web and mobile for our clients. If you are looking for development help, contact us today ✉.