Traffic

Traffic represents how many requests are being serviced over a given time. A common way to measure traffic is requests/second. I chose to build 3 different charts for the 3 technologies in the data processing pipeline architecture (Pub/Sub, Cloud Dataflow and BigQuery) to make it easier to read because the y axis scales turned out to be orders of magnitude different for each metric. You may choose to include them on a single chart for simplicity.

Dataflow traffic chart

Stackdriver Monitoring provides many different metrics for Cloud Dataflow which you can find in the metrics documentation. Broadly, they are categorized into overall job metrics like job/status or job/total_vcpu_time and processing metrics like job/element_count and job/estimated_byte_count .

Since we’re looking to monitor the traffic through Cloud Dataflow, the job/element_count which represents “The number of elements added to the pcollection so far” aligns well with measuring the amount of traffic. Importantly, the metric will increase with an increase in the volume of traffic. Thus, it’s a reasonable metric to use to understand the traffic coming into a pipeline.

The screenshot below captures the Cloud Dataflow traffic chart in the dashboard.

Pub/Sub traffic chart

Stackdriver Monitoring metrics for Pub/Sub are categorized into topic, subscription and snapshot metrics. Both subscription and topic metrics may be used to chart the traffic since they represent both sides of a messages published to Pub/Sub.

Since I want to see the amount of incoming traffic, looking at the metrics around the inbound topics that receive the data is a reasonable choice. Specifically, the topic/send_request_count which represents the “Cumulative count of publish requests, grouped by result” aligns well with measuring the amount of traffic.

The screenshot below captures the Pub/Sub traffic chart in the dashboard.

BigQuery traffic chart

Stackdriver Monitoring metrics for BigQuery are categorized into bigquery_project, bigquery_dataset and query metrics.

Since I would like to see the amount of incoming traffic, looking at the metrics related to uploaded data is a reasonable choice. Specifically, the storage/uploaded_bytes aligns well with measuring incoming traffic to BigQuery.

The screenshot below captures the BigQuery traffic chart in the dashboard.

Latency

Latency represents how long it takes to service a request over a given time. A common way to measure latency is time required to service a request in seconds. In this sample architecture with Pub/Sub, BigQuery and Cloud Dataflow, the metrics that may be useful to understand latency may indicate how long it takes to go through the Cloud Dataflow or steps in the Cloud Dataflow, how long a message is unacknowledged in Pub/Sub and how long it takes to insert records into BigQuery.

System lag chart

Since I’d like to see the amount of time that it takes to service requests, looking at the metrics related to processing time and lag area reasonable choices. Specifically, the job/data_watermark_age which represents “The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline” and the job/system_lag which represents “The current maximum duration that an item of data has been awaiting processing, in seconds” aligns well with measuring the time taken to be processed through the Cloud Dataflow pipeline.

The screenshot below captures the Cloud Dataflow system lag chart in the dashboard.

Saturation

Saturation represents how utilized the resources are that run your service. Saturation is meant to monitor a metric that show when the system may begin being constrained. In this sample architecture with Pub/Sub, BigQuery and Cloud Dataflow, the metrics that may be useful to understand saturation are the oldest unacknowledged messages (if processing slows down, then the messages will remain in Pub/Sub longer) and in Cloud Dataflow, the watermark age of the data (if processing slows down, then messages will take longer to get through the pipeline).

Saturation chart

Since I’d like to see when the service is approaching provisioned capacity, one assumption that I can make is that the time to process a given message will slow down as the system approaches fully utilizing its resources. This may not always be the case for data processing pipelines in general. Since I am processing asynchronously with Pub/Sub and Cloud Dataflow, this assumption is a reasonable one.

Specifically, the job/data_watermark_age which we used above and the topic/oldest_unacked_message_age_by_region which represents “Age (in seconds) of the oldest unacknowledged message in a topic” aligns well with measuring the increases in Cloud Dataflow processing time and time for the pipeline to receive/acknowledge input messages from Pub/Sub.

The screenshot below captures the Saturation chart for Pub/Sub and Cloud Dataflow in the dashboard.

Errors

Errors represents application errors, infrastructure errors or failure rates. The point here is to monitor a metric that shows an increased error rate when errors are encountered. In this sample architecture with Pub/Sub, BigQuery and Cloud Dataflow, the metrics that may be useful to understand saturation are the errors reported in the logs for Pub/Sub, Cloud Dataflow and BigQuery.

Data processing pipeline errors chart

Since I’d like to see the error rated for the service, I can look at the errors that are reported in the logs for the services included in the architecture.

Specifically, the log_entry_count which represents the "Number of log entries” specific for each of the 3 services aligns well with measuring the increases in the number of errors.

The screenshot below captures the Errors chart for Pub/Sub, Cloud Dataflow and BigQuery in the dashboard.

Using the dashboard

In this post, I have described an approach for selecting metrics for a data processing pipeline based on the “Four Golden Signals”. You can easily build this dashboard yourself by hand in the Dashboards section of Stackdriver Monitoring console. However, an even better approach is to use a dashboard template. Read part 2 of this series to learn how to deploy this dashboard from a JSON template.