What platform do my users run on? How many users do I have? Is it okay to drop support for something?

Open source library developers have an analytics problem. How do you answer questions like this when people are embedding your software as a component of their own product?

For example, you, the developer of the library frinkiac-iac would like to drop support for Python 2.6 because it is old, unsupported by upstream, and restricting your use of newer Python features. Is it okay to do this? Since you don’t have any data you may attempt to compensate for this lack of insight in a variety of ways:

Reach out through your community (mailing lists, bug trackers, IRC channels) to tentatively suggest the idea.

Do a release with a deprecation warning with a timeline for removal that will potentially be massively pushed back as users complain.

Just remove it and see if anybody complains.

Wait for a major package to do it and follow their lead.

Each of these choices can work, but frequently result in major breakage for users or an unacceptably long timeline for removal resulting in significant maintenance burden. You need data, but it’s so hard to collect it. Can’t someone else do it?

BigQuery

Quietly announced back in May 2016, PyPI download statistics are the most powerful tool a Python developer can use to inform their decision making. To unleash its awesome power start by loading the dataset. Preemptive apology: If you are a new Google Cloud user you may be confronted with an unpleasant set of modals asking you to agree to various things and create a new project. What you actually need to do is get to billing and start a free trial. Once you’ve got the BigQuery console available you should be able to click that link again and access the page. Now you can click Compose Query!

BigQuery uses a SQL-like language. A query reference is available to help you write your own queries. Additionally, you can see the set of available fields you can query against by clicking downloads on the left pane and looking at the schema. One note: file.project is the normalized name of the project.

With this robust data set we can make substantially more informed decisions, but here are some examples to help you out. All the examples here use what Google references as “legacy SQL”. Despite the name this is the default for queries, but feel free to use the standard SQL if you’d like!

Query: Python Versions

In the original scenario we were curious if we could drop support for Python 2.6. What does that query look like? For the following examples we’ll be using cryptography since that’s my primary project and the one I run queries on most commonly.

SELECT REGEXP_EXTRACT ( details . python , r "^([^ \. ]+ \. [^ \. ]+)" ) as python_version , COUNT ( * ) as download_count , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) WHERE file . project = 'cryptography' GROUP BY python_version , ORDER BY download_count DESC LIMIT 100

And our results (as of 2016-12-09):

Row python_version download_count 1 2.7 2208835 2 3.5 202515 3 2.6 109134 4 null 91076 5 3.4 87837 6 3.3 5623 7 3.6 1354 8 3.7 643 9 1.17 341 10 3.2 270 11 3.1 2 12 2.4 1 13 2.5 1

As you can see, Python 2.6 makes up 109,134 out of 2,707,632 downloads in the past 30 days. This represents roughly 4% of downloads. Is that low enough to drop support?.

null also makes up approximately 3.4% of downloads. These are downloads from PyPI using clients that do not support sending the statistics we’re querying against. This can be an older version of pip or alternate clients. You also see 341 downloads from 1.17, which is…who knows! When making maintenance decisions you should factor these unknowns as you feel appropriate.

Query: OpenSSL versions

cryptography supports a wide variety of OpenSSL versions. However, supporting 0.9.8 and 1.0.0 are a significant challenge since they are missing many of the features we need (and are no longer supported by upstream). Let’s craft a query to see what versions of OpenSSL are in use:

SELECT details . system . name , REGEXP_EXTRACT ( details . openssl_version , r "^OpenSSL ([^ ]+) " ) as openssl_version , COUNT ( * ) as download_count , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) WHERE details . openssl_version IS NOT NULL GROUP BY details . system . name , openssl_version , HAVING download_count >= 100 ORDER BY download_count DESC LIMIT 100

This gives us a view of what OpenSSL is linked against Python broken down by platform:

Row details_system_name openssl_version download_count 1 Linux 1.0.1f 56152115 2 Linux 1.0.1t 26301089 3 Linux 1.0.2g 25892721 4 Linux 1.0.1e-fips 24082526 5 Linux 1.0.1 19199858 6 Darwin 0.9.8zh 15528451 7 Linux 1.0.1k-fips 9029119 8 Linux 1.0.2j 8499043 9 Windows 1.0.2h 5037984 10 Darwin 1.0.2j 3077958

While I haven’t provided the entire set of results it turns out less than 100,000 downloads out of 210,063,137 were made using OpenSSL 1.0.0. 0.9.8 holds a much greater share, but only due to Darwin (aka macOS…aka OS X). In cryptography’s case we statically link wheels on Mac and Windows so we can ignore the OpenSSL version on those platforms. Looks like dropping 0.9.8 and 1.0.0 is probably safe!

Query: Most Popular Projects

Maybe you just want to know how popular your package is relative to others in the past 30 days.

SELECT file . project , COUNT ( * ) as total_downloads , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) GROUP BY file . project ORDER BY total_downloads DESC LIMIT 100

Query: How Many Downloads Did My Project Get?

Or maybe just your package:

SELECT COUNT ( * ) as total_downloads , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) WHERE file . project = 'cryptography'

Query: Downloads By Filename Filtered By Project and Version

If you ship multiple release artifacts (platform/version specific wheels as well as sdist) this query can show you their popularity.

SELECT file . filename , COUNT ( * ) as total_downloads , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) WHERE file . project = 'cryptography' AND file . filename like '%1.6%' GROUP BY file . filename ORDER BY total_downloads DESC LIMIT 100

Query: Python 2 vs 3 For A Single Project

We broke it down by Python release previously, but what if you just want to know 2 vs 3?

SELECT ROUND ( 100 * SUM ( CASE WHEN REGEXP_EXTRACT ( details . python , r "^([^ \. ]+)" ) = "3" THEN 1 ELSE 0 END ) / COUNT ( * ), 1 ) AS percent_3 , COUNT ( * ) as download_count , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) WHERE file . project = 'cryptography' ORDER BY download_count DESC LIMIT 100

In cryptography’s case Python 3 currently makes up 11% of downloads.

Query: Percentage of Downloads By Python 3 In The Top 100

Of course, a trivial modification and we can see the Python 3 percentage in the top 100 packages:

SELECT file . project , ROUND ( 100 * SUM ( CASE WHEN REGEXP_EXTRACT ( details . python , r "^([^ \. ]+)" ) = "3" THEN 1 ELSE 0 END ) / COUNT ( * ), 1 ) AS percent_3 , COUNT ( * ) as download_count , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) group by file . project ORDER BY download_count DESC LIMIT 100

Query: Highest Python 3 Usage With More Than 100,000 Downloads Per 30 Days

SELECT file . project , ROUND ( 100 * SUM ( CASE WHEN REGEXP_EXTRACT ( details . python , r "^([^ \. ]+)" ) = "3" THEN 1 ELSE 0 END ) / COUNT ( * ), 1 ) AS percent_3 , COUNT ( * ) as download_count , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) group by file . project HAVING download_count > 100000 ORDER BY percent_3 DESC LIMIT 100

The top 10 is interesting here:

Row Date Percentage Total Downloads 1 async-timeout 98.8 117724 2 multidict 89.4 157661 3 yarl 82.1 118081 4 aiohttp 74.6 209035 5 azure-servicebus 72.3 151706 6 plumbum 60.1 131295 7 mysqlclient 57.9 140905 8 pkginfo 56.2 102214 9 azure-nspkg 55.2 214821 10 azure-storage 54.7 213129

Query: Which Packages By Percentage Are Downloaded Most Often Via Python 2.6?

SELECT file . project , ROUND ( 100 * SUM ( CASE WHEN REGEXP_EXTRACT ( details . python , r "^([^ \. ]+ \. [^ \. ]+)" ) = "2.6" THEN 1 ELSE 0 END ) / COUNT ( * ), 1 ) AS percent_26 , SUM ( CASE WHEN REGEXP_EXTRACT ( details . python , r "^([^ \. ]+ \. [^ \. ]+)" ) = "2.6" THEN 1 ELSE 0 END ) as total_26_downloads , COUNT ( * ) as total_downloads , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) WHERE details . python IS NOT NULL AND # Exclude things which are stdlib backports * for * Python 2 . 6 file . project NOT IN ( "argparse" , "ordereddict" ) GROUP BY file . project , HAVING total_downloads > 5000 ORDER BY percent_26 DESC LIMIT 250

Query: Most Used Installers/Versions

SELECT details . installer . name , details . installer . version , COUNT ( * ) as total_downloads FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) GROUP BY details . installer . name , details . installer . version ORDER BY total_downloads DESC LIMIT 100

Query: Downloads By Country

SELECT country_code , COUNT ( * ) as downloads , FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 31 , "day" ), DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "day" ) ) GROUP BY country_code ORDER BY downloads DESC LIMIT 100

Query: Python 3 Download Percentage Across PyPI Grouped By Month

SELECT STRFTIME_UTC_USEC ( timestamp , "%Y-%m" ) AS yyyymm , ROUND ( 100 * SUM ( CASE WHEN REGEXP_EXTRACT ( details . python , r "^([^ \. ]+)" ) = "3" THEN 1 ELSE 0 END ) / COUNT ( * ), 1 ) AS percent_3 , COUNT ( * ) as download_count FROM TABLE_DATE_RANGE ( [ the - psf : pypi . downloads ], DATE_ADD ( CURRENT_TIMESTAMP (), - 1 , "year" ), CURRENT_TIMESTAMP () ) group by yyyymm ORDER BY yyyymm DESC LIMIT 100

This is an interesting query for obvious reasons, but also one that processes quite a bit of data. Here are the results through January 2017:

Date Percentage Total Downloads 2017-01 10.3 619254519 2016-12 9.6 537529176 2016-11 9.2 563208796 2016-10 8.9 528112583 2016-09 7.9 538323554 2016-08 7.4 542956355 2016-07 7.1 483143293 2016-06 7.1 428402893 2016-05 7.1 137138123 2016-03 6.6 63615708 2016-02 6.5 314275153 2016-01 6.4 98290315

Data ingestion into the BigQuery data set was spotty prior to June 2016, but you can see a significant uptick in Python 3 based downloads over 2016. If these trends continue…

Decisions

These queries just scratch the surface of the data you can get about the Python ecosystem at large and your project in particular. Write your own queries and make your decisions about platform, version, and implementation with real information.

Of course, this doesn’t address more granular usage, exception tracking, and other analytics web and mobile devs frequently take for granted, but that’s a separate blog post.

Thanks to Donald Stufft for his tireless efforts on PyPI, Alex Gaynor for several of the queries in this blog post, and Google for generously donating capacity on BigQuery for PyPI data.