Querying geographic raster data in BigQuery the brute force way

Represent each pixel by a polygon! Don’t worry. BigQuery can handle it.

A couple of times over the past week, I’ve gotten a variant of this question: “I have raster geographic data. Is there a way to transform it so that I can use BigQuery GIS?” The reason is that the person askingwants to be able to use good old SQL to join geospatial imagery against other GIS datasets.

One possibility is to take the gridded data, contour it, and store the contours in BigQuery as polygons. Then, we can join the contours against other geographic types. This seems reasonable, but I know from experience that contouring a gridded field is not straightforward. Firstly, algorithms such as Marching Squares result in boxy, unnatural contours. Trying to fix this problem by smoothing the field before contouring leads to loss of accuracy. Secondly, it is necessary to create contours at various iso-levels, and if the field in question consists of continuous values, the contour representations are lossy once again.

Is there a better way?

Store each pixel as a polygon in BigQuery

Well, BigQuery can deal with petabyte-scale datasets, so what if we did something brute force? Something that people normally shy away from because their tools can’t handle it? What if we take every pixel of the grid, and store it as a polygon?

NASA publishes a global gridded population density that looks like this:

We can download the population density data as an ASCII text file organized as 1800 rows and 3600 columns, where each pixel corresponds to 0.1 degree of latitude and longitude.

The first thing I did was to convert the downloaded text file into a JSON file where the pixels were represented by a polygon of the boundaries of the pixels:

# represent each pixel by a polygon of its corners

top = ORIGIN_LAT - rowno * LATRES

bot = ORIGIN_LAT - (rowno+1) * LATRES

left = ORIGIN_LON + colno * LONRES

right = ORIGIN_LON + (colno+1) * LONRES

poly = 'POLYGON(({:.2f} {:.2f}, {:.2f} {:.2f}, {:.2f} {:.2f}, {:.2f} {:.2f}, {:.2f} {:.2f}))'.format(

left, top, # topleft

left, bot, # botleft

right, bot, # botright

right, top, # topright

left, top # same as first point

)

center = 'POINT({:2f} {:2f})'.format( (left+right)/2, (top+bot)/2 )

pixel = {

'rowno': rowno,

'colno': colno,

'location': center,

'bounds': poly,

'population_density': value

}

The resulting JSON file has lines that look like this:

{"rowno": 64, "colno": 1451, "location": "POINT(-34.850000 83.550000)", "bounds": "POLYGON((-34.90 83.60, -34.90 83.50, -34.80 83.50, -34.80 83.60, -34.90 83.60))", "population_density": 1.04}

Note that the geography types are represented in WellKnownText (WKT) form.

Now, we can write up the schema:

[

{

"description": "row number",

"mode": "REQUIRED",

"name": "rowno",

"type": "INTEGER"

},

{

"description": "column number",

"mode": "REQUIRED",

"name": "colno",

"type": "INTEGER"

},

{

"description": "centroid location",

"mode": "REQUIRED",

"name": "location",

"type": "GEOGRAPHY"

},

{

"description": "polygon representing boundary of pixel",

"mode": "REQUIRED",

"name": "bounds",

"type": "GEOGRAPHY"

},

{

"description": "population density (persons per sq km)",

"mode": "REQUIRED",

"name": "population_density",

"type": "FLOAT"

}

]

Then, load it into BigQuery using:

bq load --source_format=NEWLINE_DELIMITED_JSON \

advdata.popdensity popdensity_geo.json.gz schema.json

So, the data are loaded in and it’s an exact representation. But can we actually query this monster efficiently?

Query Performance and Data Visualization

Let’s see … Here’s a query that joins our population density raster data against a public dataset of urban areas to find the heaviest populated areas in Washington state:

WITH urban_populations AS (

SELECT

lsad_name

, SUM(ST_AREA(bounds)/1000000) AS area_sqkm

, COUNT(1) AS num_pixels

, AVG(population_density) AS pop_density

FROM advdata.popdensity, `bigquery-public-data.geo_us_boundaries.urban_areas`

WHERE ST_INTERSECTS(bounds, urban_area_geom)

AND STRPOS(lsad_name, ', WA') > 0

GROUP BY lsad_name

) SELECT

*, (area_sqkm * pop_density / 1000000) AS population_millions

FROM urban_populations

ORDER BY area_sqkm DESC

LIMIT 10

The query above is reasonably complicated — it computes the areas of the pixels on the fly and does intersection checks between each pixel polygon and each urban area polygon — so if this works efficiently, we ought to be in good shape.

I discovered that the query processes 500 MB and comes back in just 4 seconds:

This looks correct (Seattle has a metro population of 3.7 million in 2020, so 3.17 million in 2000 appears reasonable). The 500 MB query, using the on-demand pricing, would cost about 1/4 cent, so it’s cost-effective as well.

Let’s spot check with a bit of visualization that our transformations are correct. Let’s go over to BigQuery GeoViz and query for the 1000 most densely populated grid cells in the United States:

SELECT

location, population_density, bounds

FROM advdata.popdensity,

`bigquery-public-data`.utility_us.us_border_area

WHERE ST_WITHIN(location, us_border_geom)

ORDER BY population_density DESC

LIMIT 1000

I specified bounds to be the geometry column to visualize and zoomed in on the Four Corners area of the Southwestern United States:

Sweet!

100x the data

The data we downloaded from NASA “has been scaled and resampled … and may not be suitable for rigorous scientific use”. Instead of using the approximately 10km resolution data that NASA publishes, what if we use 1km resolution pixels? The raw data for global population density is available from Columbia University, so I registered and downloaded the highest resolution data they have for 2020:

The 1km resolution dataset is 100x the size of the NASA dataset. Let’s see if BigQuery is up to the challenge.

Converting the data and loading it into BigQuery takes a lot longer because, well, 100x more data. The data is also now big enough that I staged it in Google Cloud Storage instead of loading it into BigQuery directly from my laptop. Of course, I could have parallelized the code to create the GeoJSON, but I didn’t.

One change I did do was to add a column specifying the tile the pixel comes from. I then specified the tile as a clustering column in the BigQuery table:

bq load --replace \

--source_format NEWLINE_DELIMITED_JSON \

--time_partitioning_type DAY \

--clustering_fields tile \

advdata.popdensity $GCSFILE schema.json

Clustering helps improve query performance and reduce query costs. This is important since the 1km global grid is 52 GB! The way the SEDAC data is setup, most queries can get away with processing just one or two tiles:

For example, the entire state of Washington is in tile #1, so I can specify this in the WHERE clause:

WITH urban_populations AS (

SELECT

lsad_name

, SUM(ST_AREA(bounds)/1000000) AS area_sqkm

, COUNT(1) AS num_pixels

, AVG(population_density) AS pop_density

FROM advdata.popdensity, `bigquery-public-data.geo_us_boundaries.urban_areas`

WHERE

tile = 'gpw_v4_population_density_rev11_2020_30_sec_1.asc'

AND ST_INTERSECTS(bounds, urban_area_geom)

AND STRPOS(lsad_name, ', WA') > 0

GROUP BY lsad_name

) SELECT

*, (area_sqkm * pop_density / 1000000) AS population_millions

FROM urban_populations

ORDER BY area_sqkm DESC

LIMIT 10

Here’s the result for the urban areas in Washington (note that the number of pixels is more now because we are using 1km resolution data):

The query on the 1km resolution data processes 5.8 GB (cost: 2.5c) but still takes only 6 seconds!

And here’s the GeoViz — note how far in I have to zoom to get the densely populated pixels:

This completely surprised me. You can brute-force query raster GIS data in BigQuery. It’s fast, it’s cheap, and it’s accurate. Amazing!

Next steps: