Step 2: Load into BigQuery

Once we have the files on our side, loading the data CSVs is straightforward:

bq load --autodetect \

fh-bigquery:deleting.uber_sf_censustracts_201803_all_hourly \

san_francisco-censustracts-2018-3-All-HourlyAggregate.csv bq load --autodetect \

fh-bigquery:deleting.uber_sf_censustracts_201803_weekdays_hourly \

san_francisco-censustracts-2018-3-OnlyWeekdays-HourlyAggregate.csv bq load --autodetect \

fh-bigquery:deleting.uber_sf_censustracts_201803_weekends_hourly \

san_francisco-censustracts-2018-3-OnlyWeekends-HourlyAggregate.csv

However the geo boundaries files will pose some challenges. They are standard GeoJSON files, but we’ll have to massage them before loading into BigQuery:

Transform the GeoJSON file to a new line delimited JSON file with jq . Load the new .json files as CSV into BigQuery. Parse the JSON rows in BigQuery to generate native GIS geometries.

jq -c .features[] \

san_francisco_censustracts.json > sf_censustracts_201905.json bq load --source_format=CSV \

--quote='' --field_delimiter='|' \

fh-bigquery:deleting.sf_censustracts_201905 \

sf_censustracts_201905.json row

For step 3, we can parse the loaded rows inside BigQuery:

CREATE OR REPLACE TABLE `fh-bigquery.uber_201905.sf_censustracts`

AS SELECT FORMAT('%f,%f', ST_Y(centroid), ST_X(centroid)) lat_lon, *

FROM (

SELECT *, ST_CENTROID(geometry) centroid

FROM (

SELECT

CAST(JSON_EXTRACT_SCALAR(row, '$.properties.MOVEMENT_ID') AS INT64) movement_id

, JSON_EXTRACT_SCALAR(row, '$.properties.DISPLAY_NAME') display_name

, ST_GeogFromGeoJson(JSON_EXTRACT(row, '$.geometry')) geometry

FROM `fh-bigquery.deleting.sf_censustracts_201905`

)

)

Find some alternatives (JavaScript, ogr2ogr) from Lak Lakshmanan and Michael Entin to load GeoJSON data in this Stack Overflow reply.

Step 3: Massage the data for performance and efficiency

Now that we have both tables inside BigQuery, let’s massage the data to partition the main table, create native BQ GIS geometry columns, and join everything together:

Let’s create our main table. This table will contain both weekdays, weekends, and overall stats — and we’ll add some census tract data to make it easier to visualize and understand. For efficiency we’ll partition it by quarter and cluster by the type of stat and travel starting place.

I’ll also calculate some additional stats: Average distance between areas and speed given this distance. Note that having these stats will make the query run way slower than if we skipped them:

CREATE OR REPLACE TABLE `fh-bigquery.uber_201905.sf_hourly`

PARTITION BY quarter

CLUSTER BY table_source, source, destination

-- don't partition/cluster if using BQ w/o credit card AS SELECT *, distance * 0.000621371192 / geometric_mean_travel_time * 3600 speed_mph

FROM (

SELECT * EXCEPT(geo_b, geo_c), (ST_DISTANCE(geo_b, geo_c)+ST_MAXDISTANCE(geo_b, geo_c))/2 distance

FROM (

SELECT a.*, SPLIT(_TABLE_SUFFIX, '_')[OFFSET(0)] table_source

, b.display_name source

, c.display_name destination

, b.lat_lon sourceid_lat_lon

, CAST(SPLIT(b.lat_lon, ',')[OFFSET(0)] AS FLOAT64) sourceid_lat

, CAST(SPLIT(b.lat_lon, ',')[OFFSET(1)] AS FLOAT64) sourceid_lon

, c.lat_lon dstid_lat_lon

, CAST(SPLIT(c.lat_lon, ',')[OFFSET(0)] AS FLOAT64) dstid_lat

, CAST(SPLIT(c.lat_lon, ',')[OFFSET(1)] AS FLOAT64) dstid_lon

, DATE('2018-07-01') quarter

, COUNT(*) OVER(PARTITION BY sourceid, dstid) source_dst_popularity

, COUNT(*) OVER(PARTITION BY dstid) dst_popularity

, b.geometry geo_b, c.geometry geo_c

FROM `fh-bigquery.deleting.uber_sf_censustracts_201803_*` a

JOIN `fh-bigquery.uber_201905.sf_censustracts` b

ON a.sourceid=b.movement_id

JOIN `fh-bigquery.uber_201905.sf_censustracts` c

ON a.dstid=c.movement_id

)

)

Step 4: Query and visualize

I created an interactive dashboard with Data Studio. Note that it runs super fast thanks to our new BigQuery BI Engine. But you can also run your own queries!

For example, the worst destinations in San Francisco and time of the day when arriving to the SFO airport: