BigQuery is generally praised for its throughput in executing queries, however most users have come to hate how long it takes to retrieve large query results. Several years, StackOverflow questions, and GitHub issues later, Google is tackling this issue with the BigQuery Storage API.

The new API doesn’t fully replace the BigQuery API, but rather it augments the classic API: instead of returning results as paginated text, the new API retrieves the results in a continuous stream of serialized binary data.

A quick benchmark shows that the BigQuery Storage API can be tens of times faster than the BigQuery API. The benefits materialize only above some threshold. In the case of this rough benchmark the new API started to outperform the classic API somewhere between 100,000 and 1 million returned rows (~3.8MB to ~38MB). For example, aside from the actual query execution time, retrieving 10 million rows was almost 20 times faster.

Pandas developers have made using the new API dead simple. Even those familiar with pandas.read_gbq might not spot the difference at first. It’s as small as passing the argument use_bqstorage_api=True .

Behind the Pandas abstraction the query continues to be executed through the classic API Python client. However, the results are fetched using a client for the new API.

The Python BigQuery Storage API library can also be used to load table contents directly into memory. Below you can find a basic example where only a subset of columns and rows are retrieved. In practice, it could be enhanced by retrieving data across multiple streams, modifiers, and more.

Perhaps the only catch with the new API is that it adds new costs. While the first 300TB are currently offered for free, subsequent operations are billed at $1.10 per TB read.

The BigQuery Storage API is still in beta, but it’s well worth a try!