Now this looks better: Once we exclude Python 3.6, we see a healthy adoption curve for PySpark — dominated by Python 2.7 installations.

This shows that having access to raw data make our numbers way more insightful than just getting them pre-aggregated. You might have noticed that we had to query 200GB of data each time — draining our monthly free terabyte of querying pretty quickly.

Let’s use the power of clustered tables to improve that.

Same query over a clustered table

Let’s rewrite that last query over a clustered table:

SELECT TIMESTAMP_TRUNC(timestamp, WEEK) week

, REGEXP_EXTRACT(details.python, r'^\d*\.\d*') python

, COUNT(*) downloads

FROM `fh-bigquery.pypi.pypi_2017`

WHERE project='pyspark'

AND timestamp>'2000-01-01' # nag

GROUP BY week, python

HAVING python != '3.6' AND week<'2017-12-30'

ORDER BY week 5.4 sec elapsed, 9.65 GB processed # winning

That’s cool: If you go to my clustered tables instead of the existing official ones, the same query will scan only ~5% of the data in ~½ of the the time.

Things to notice:

FROM `fh-bigquery.pypi.pypi_2017` is where you can find my clustered tables. I’m creating one per year.

is where you can find my clustered tables. I’m creating one per year. FROM `fh-bigquery.pypi.pypi_20*` allows you to match every year.

allows you to match every year. WHERE project=’pyspark' allows you to benefit from clustering, which can prune by project name in these tables.

allows you to benefit from clustering, which can prune by project name in these tables. file.project on the original tables had to be replaced by project , as clustering doesn’t work over nested columns (yet).

on the original tables had to be replaced by , as clustering doesn’t work over nested columns (yet). timestamp>'2000-01-01' is my way to forcefully remind you that these are time partitioned tables. You’ll query proportionally less data if you limit your queries to a portion of the year — or you could explicitly ask for everything by starting at a super-early date.

is my way to forcefully remind you that these are time partitioned tables. You’ll query proportionally less data if you limit your queries to a portion of the year — or you could explicitly ask for everything by starting at a super-early date. REGEXP_EXTRACT(details.python, r’^\d*\.\d*’) keeps the Python versions down to the most significant digits.

keeps the Python versions down to the most significant digits. These queries will only run with #standardSQL as #legacySQL doesn’t support clustered tables.

I say “my clustered tables” as I’m offering them as an un-official place to get these logs, while we improve this process. With your feedback and comments we’ll work to incorporate these improvements into the official repository.

Worklog

Recluster

To re-cluster each year of existing data I did something like:



PARTITION BY DATE(timestamp)

CLUSTER BY project, country_code

OPTIONS(

description="repartitioned and clustered from

, require_partition_filter=true

)

AS

SELECT file.project project

, STRUCT(file.filename, file.type, file.version) file

, * EXCEPT(file)

FROM `the-psf.pypi.downloads2017*` CREATE TABLE `fh-bigquery.pypi.pypi_2017`PARTITION BY DATE(timestamp)CLUSTER BY project, country_codeOPTIONS(description="repartitioned and clustered from https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20180906 , require_partition_filter=trueASSELECT file.project project, STRUCT(file.filename, file.type, file.version) file, * EXCEPT(file)FROM `the-psf.pypi.downloads2017*` 34 min elapsed, 2.44 TB processed

Daily updates

Every day at 3am I’ll run a scheduled query that adds the previous day worth of data:

INSERT INTO `fh-bigquery.pypi.pypi_2018` (project, file, timestamp, country_code, url, details, tls_protocol, tls_cipher) SELECT file.project project

, STRUCT(file.filename, file.type, file.version) file

, * EXCEPT(file)

FROM `the-psf.pypi.downloads2018*`

WHERE _TABLE_SUFFIX = FORMAT_TIMESTAMP('%m%d', TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR))

AND timestamp > ( # for idempotency

SELECT MAX(timestamp)

FROM `fh-bigquery.pypi.pypi_2018`

WHERE timestamp>TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24*2 HOUR)

) 48.9 sec elapsed, 24.83 GB processed

Configuring a scheduled query

FAQ

BigQuery says it will process a lot of gigabytes, and I expected less

With clustered tables BigQuery gives an estimate of the maximum billable bytes before running a query — but might query way less than that if possible.

Some examples:

SELECT *

FROM `fh-bigquery.pypi.pypi_2017`

WHERE timestamp>='2017-12-01'

LIMIT 1 # 1.6s elapsed, 28.6 MB processed

# Stops scanning when it finds the first row SELECT *

FROM `fh-bigquery.pypi.pypi_2017`

WHERE timestamp>='2017-12-01'

AND project='rpy2'

LIMIT 1 2.0s elapsed, 386 MB processed

# Finds the cluster that contains 'rpy2' and scans that SELECT *

FROM `fh-bigquery.pypi.pypi_2017`

WHERE timestamp>='2017-12-01'

AND project='88888'

LIMIT 1 2.4s elapsed, 4.58 GB processed

# Finds the clusters which could contain '88888', but it's not there SELECT *

FROM `fh-bigquery.pypi.pypi_2017`

WHERE timestamp>='2017-12-01'

AND project LIKE '%random%'

LIMIT 1 1.8s elapsed, 171 MB processed

# Opens all the clusters, until one contains a *random* project SELECT *

FROM `fh-bigquery.pypi.pypi_2017`

WHERE timestamp>='2017-12-01'

AND project LIKE '%DOESNT-EXIST%'

LIMIT 1 2.7s elapsed, 221 GB processed

# Opens all the clusters, and keeps searching for an un-existing pattern

Acknowledgements

Thanks Donald Stufft, for making all of this possible.

Welcoming di_codes to the GCP team.

Next steps

We should move these clustered tables into the official repo — but let’s discuss them first:

Python distutils-sig@ discussion thread for this topic.

Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.