While stuck in a busy morning traffic in Sydney, I was wondering if I can bypass the traffic congestion by changing my travelling time. If yes, what are the ideal time ranges? Is there any pattern around the hours the traffic congestion starts building up? I can always search google map to see current traffic condition, but is there a consistent daily pattern of traffic buildup? Luckily NSW Road and Maritime Services (RMS) has some live cameras which provide live images every minute accessible to public. The list of all cameras can be found here — http://m.livetraffic.rta.nsw.gov.au/CameraList.aspx. I was aware of both Amazon Rekognition and Google cloud Vision API — what cool things they can do. I fed one of the RMS images to both AWS and GCP API and was blown away that it can detect traffic congestions quite well. One interesting feature I found with GCP Cloud Vision API though that the confidence score responds quite well with the intensity of the congestion. I thought, how about I use this API to collect traffic congestion data and analyze if there is any pattern. To do so, I need to collect the images from the camera every minute and then feed into API to get the score and store those data for later analysis. Luckily GCP has answer to that, and AWS does it too. But I chose GCP to keep the pipeline completely serverless. Having BigQuery in the eco-system to store data on the fly without worrying about spinning up a cluster and having it always available was big plus for GCP. Furthermore, the output generated by Vision API in JSON format was easy to analyze in BigQuery as well. So let’s start building the pipeline.

The pipeline

To download the camera images every minute and process the image using Vision API, I used “Google Cloud Functions” which is the serverless way to run code in GCP (equivalent to AWS Lambda). My cloud function invokes the vision API every 5 minutes and stores the complete response in JSON in BigQuery along with date time when the data was inserted into the BigQuery. This timestamp will be used later to analyze the traffic over time. “Google Cloud Scheduler” performs the duty of invoking the Cloud Function every 5 minutes by passing a dummy message to Google Cloud Pub/Sub pre-configured topic which securely triggers the Cloud Function. Overall, the pipeline looks like below.

Pipeline Components:

Cloud Vision API:

Cloud Vision API (https://cloud.google.com/vision/) detects useful insights from images using pre-trained models. It can “Assign labels to images and quickly classify them into millions of predefined categories. Detect objects and faces, read printed and handwritten text, and build valuable metadata into your image catalog.” As for example, Vision API attached the following labels to the image below with the confidence score shown.

Cloud Function

Google Cloud Function is the serverless way to execute codes. Currently you can code in Python, Node.js and Go. Does not support lots of programming languages that AWS Lambda supports but they are gradually being added. For this experiment, I used python 3.7 runtime.

The inline source code editor gives two tabs: main.py and requirements.txt. main.py contains the actual function code. Any 3rd party libraries dependencies needs to be included in the requirements.txt.

The easiest way to trigger the cloud function is via the HTTP trigger but it’s the most unsecured too. Anybody who knows your Cloud Function url, can invoke the HTTP endpoint causing the cloud function to trigger which can throw you out of the free Vision API quota of 1000 calls per month. That’s why I choose pub/sub trigger. Using cloud scheduler I published a message in pub/sub every five minutes which triggered the cloud function.

The complete python code below.

import base64 def cloudfn_pubsub_detect_congestion(event, context): “””Triggered from a message on a Cloud Pub/Sub topic. Args: event (dict): Event payload. context (google.cloud.functions.Context): Metadata for the event. “”” pubsub_message = base64.b64decode(event[‘data’]).decode(‘utf-8’) print(pubsub_message) “””Detects labels in the file and writes the output into BigQuery.””” from google.cloud import vision from google.cloud import bigquery import datetime uri=’http://www.rms.nsw.gov.au/trafficreports/cameras/camera_images/humehwy_bankstown.jpg?1549526684549' bq_dataset=’POC’ bq_table=’test’ vision_client = vision.ImageAnnotatorClient() image = vision.types.Image() image.source.image_uri = uri response = vision_client.label_detection(image=image) labels = response.label_annotations #print(labels) label_data = [] #empty list for label in labels: label_data.append((label.description, label.score)) rows_to_insert = [] #empty list to insert into BQ Table #you need strftime to convert to timestamp format acceptable by BQ API rows_to_insert.append((datetime.datetime.utcnow().strftime(‘%Y-%m-%dT%H:%M:%S.%fZ’), label_data)) bq_client = bigquery.Client() dataset = bq_client.dataset(bq_dataset) table_ref = dataset.table(bq_table) # Prepares a reference to the table table = bq_client.get_table(table_ref) # API call errors = bq_client.insert_rows(table, rows_to_insert) # API request assert errors == [] if not errors: print(‘Successfully loaded data’) else: print(‘Errors:’) for error in errors: print(error)

The requirements.txt file should have the following contents denoting its dependency on cloud-vision and bigquery API.

# Function dependencies, for example:

# package>=version google-cloud-bigquery google-cloud-vision

Cloud Scheduler

Cloud Scheduler is a fully managed enterprise-grade cron job scheduler. You can define a schedule using the Unix cron format so that your job runs multiple times a day or runs on specific days or months of the year. I created a cloud scheduler to publish a message in pub/sub every 5 minutes between 6 am to 6 pm for a week using the configuration below:

Visualization using Tableau

I wanted to see the average trend across the week and also by weekday if a particular day is worse than the others. I really wanted to do everything as GCP managed service. For data visualization, I wanted to use the Data Studio to finish off this analysis. But I was really having trouble to plot the data on 5 minute interval data that I collected. Looks like the lowest level you can go on time series in at hourly level and Data Studio will aggregate the data at the hourly level. After some failed attempts, I gave up and switched to good old Tableau.

While looking at the data stored in BigQuery, I found that for some of the observations don’t not have any labels. There was no error message in the logs whether it’s caused by timeout to get the image from the url or there was a problem with vision API. I used the custom query below using BigQuery’s UNNEST function to consider only the labels with “Traffic congestion” and get the corresponding score. So plotting the raw data gave me the visual below:

select datetime(inserttimestamp,’Australia/Sydney’) as dttm, score from ( SELECT a.insertTimestamp, b.score FROM `august-button-203520.myPOC.nsw_traffic_vision` a inner join unnest(a.labels) b where (b.name = ‘Traffic congestion’ ) ) a where datetime(inserttimestamp,’Australia/Sydney’) between DATETIME ‘2019–02–11 00:00:00.00’ and DATETIME ‘2019–02–18 00:00:00.00’

The fact that, “Traffic congestion” label is available for different timestamps for different week days made the chart non-continuous. So I filled up the missing timestamps where no traffic congestions are recorded with 0 using the query is below:

select datetime(inserttimestamp,’Australia/Sydney’) as dttm, score from ( SELECT inserttimestamp,0 as score FROM `august-button-203520.myPOC.nsw_traffic_vision` where inserttimestamp not in ( SELECT a.insertTimestamp FROM `august-button-203520.myPOC.nsw_traffic_vision` a inner join unnest(a.labels) b where (b.name = ‘Traffic congestion’ ) ) union all SELECT a.insertTimestamp, b.score FROM `august-button-203520.myPOC.nsw_traffic_vision` a inner join unnest(a.labels) b where (b.name = ‘Traffic congestion’ ) ) a where datetime(inserttimestamp,’Australia/Sydney’) between DATETIME ‘2019–02–11 00:00:00.00’ and DATETIME ‘2019–02–18 00:00:00.00’

Changing the score calculation to moving average of last 3, the chart looked like below:

It’s quite noisy chart. It gives the following information which is probably no surprise to daily commuters

· Tuesday morning traffics between 7:20 AM and 8:50 AM were comparatively higher among the other workdays.

· Tuesday Midday traffic spike looks interesting. Afternoon traffic starts to peak up from 3 PM (aligned with school hours) and lasts till 6 pm. Thursday afternoons are quite chaotic probably because of being the shopping night.

· Saturday 12:00 PM — 2 PM is quite busy and again around 5 PM considering it’s the weekend.