I made a simple dashboard for visualizing marshmallow PyPI downloads.

Why

While maintaining marshmallow, I found myself running one-off queries against PyPI's BigQuery dataset to answer some recurring questions:

How many marshmallow users are using Python 2 vs Python 3?

Can we drop support for a minor Python version without pissing off too many users?

Are users migrating to marshmallow 3?

How long should we support old marshmallow versions?

Executing and modifying BigQuery queries by hand quickly became tedious.

How

I run a daily scheduled query on BigQuery that subsets the PyPI dataset with the specific data that I need. This keeps my BigQuery usage well within the limits of the free tier.

BigQuery's scheduled queries UI

The query starts with a CTE that selects the date, Python major and minor versions, marshmallow major and minor versions, and platform.

WITH dls AS ( SELECT DATE_SUB ( DATE ( @ run_time ), INTERVAL 1 DAY ) AS date , file . project AS package , details . installer . name AS installer , -- Full python version details . python AS python_version , -- Python major version CAST ( SPLIT ( details . python , '.' )[ OFFSET ( 0 )] AS string ) AS python_major , -- Python minor version CAST ( CONCAT ( SPLIT ( details . python , '.' )[ OFFSET ( 0 )], '.' , SPLIT ( details . python , '.' )[ OFFSET ( 1 )]) AS string ) AS python_minor , -- Full marshmallow version file . version AS marshmallow_version , -- marshmallow major version CAST ( SPLIT ( file . version , '.' )[ OFFSET ( 0 )] AS string ) AS marshmallow_major , -- Platform details . system . name AS system FROM ` the - psf . pypi . downloads *` WHERE -- Past day _TABLE_SUFFIX = FORMAT_DATE ( '%Y%m%d' , DATE_SUB ( DATE ( @ run_time ), INTERVAL 1 DAY )) AND file . project = 'marshmallow' -- Exclude mirrors AND details . installer . name NOT IN ( "bandersnatch" , "z3c.pypimirror" , "Artifactory" , "devpi" ) AND details . python IS NOT NULL ) -- ...

Then the query groups the download counts by category:

Python major version

Python minor version

marshmallow major version

marshmallow minor version

Python minor version x marshmallow major version

-- WITH dls AS ... -- Python 2 vs 3 SELECT date , 'python_major' AS category_label , python_major AS category_value , COUNT ( * ) AS downloads FROM dls GROUP BY date , package , category_value UNION ALL -- Python minor versions SELECT date , 'python_minor' AS category_label , python_minor AS category_value , COUNT ( * ) AS downloads FROM dls GROUP BY date , package , category_value UNION ALL -- marshmallow major version SELECT date , 'marshmallow_major' AS category_label , marshmallow_major AS category_value , COUNT ( * ) AS downloads FROM dls WHERE marshmallow_major IN ( '2' , '3' ) GROUP BY date , package , category_value UNION ALL -- marshmallow minor version SELECT date , 'marshmallow_version' AS category_label , marshmallow_version AS category_value , COUNT ( * ) AS downloads FROM dls WHERE marshmallow_major IN ( '2' , '3' ) GROUP BY date , package , category_value UNION ALL -- Python minor version x marshmallow major version SELECT date , 'combined' AS category_label , CAST ( CONCAT ( 'py' , python_minor , '-' , 'marshmallow' , marshmallow_major ) AS string ) AS category_value , COUNT ( * ) AS downloads FROM dls WHERE marshmallow_major IN ( '2' , '3' ) GROUP BY date , package , category_value

There is no way to directly control for downloads on CI servers. One solution is to exclude downloads on Linux and assume that downloads on Windows and macOS machines are a representative sample of local development installations.

To achieve this, the query includes duplicates of the above SELECT s with an additional WHERE system != "Linux" clause.

-- ... Same as above, excluding Linux downloads SELECT date , 'python_major' AS category_label , CONCAT ( python_major , '-' , 'no_linux' ) AS category_value , COUNT ( * ) AS downloads FROM dls WHERE system != "Linux" GROUP BY date , package , category_value UNION ALL -- and so on ...

The full query is here. The resulting dataset looks like this:

Dataset produced by above query

After backfilling the dataset, I wrote an app to query the dataset and visualize it with Dash, a Python framework for creating interactive visualizations. With pandas at hand to take care of slicing and transforming the dataset, writing the visualization code in Python turned out to be a breeze.

The source code for the Dash app is here.

Takeaways

The majority of marshmallow users are using Python 3.6 and Python 3.7. Python 3.5 usage is low (even lower than Python 2).

Users upgrade to new versions quickly. marshmallow users install the latest point versions of the supported release lines as they're released.

Don't expect users to use pre-releases. marshmallow 3 was "release-candidate stable" for several months, and we recommended using v3 for new projects. Even so, v3 remained at < 20% until the stable release.

Links