Tracing

Internally, Ravelin is built as disparate services, usually responsible for a singular domain, that communicate amongst each other. Tracing requests between these services (in the vein of Dapper or Zipkin) is invaluable to reasoning about the performance of a distributed system. We sample a small percentage of our traffic to be traced throughout the system. If we choose to trace a given request, all RPC calls, bodies and errors are captured, along with any asynchronous consumers of messages produced on a queue. Doing this gives us full insight into the timings of requests internally, and allows us to map the entire system dynamically.

Furthermore, this allows us to perform both ad hoc queries, and detailed latency analysis over hundreds of millions of requests in a way that would be very challenging or impossible with a traditional time series database. This small percentage of traffic still results in hundreds of millions of rows a month — a quantity that BigQuery handles with ease. Storing the data in raw format means that more adventurous types could build models or simulations to back out sensitivities of the entire system to changes in a single components latency, or overall load.

API logging

We currently store all raw event requests to our API, along with headers and status codes. Doing so allows us to provide a log back to our clients when integrating with us to inspect the bodies of the requests they make, and to debug any issues returned by our API schema validation. It also allows us to understand what’s coming through our API, and debug any issues on the fly. Our job would be much harder and slower if this data was just stored on S3 in log files as it was previously.

Performance analysis and ad-hoc investigation

A crucial part of building a machine learning system is performance evaluation. BigQuery is used to calculate our live classifier metrics with judicious use of quick joins between tables: allowing our machine learning team to understand model performance as labels flow in from our clients. It allows us to quickly diagnose any discrepancies from offline training to online prediction, and to compare multiple models against each other. The same data is also used to build a reporting dashboard within our product to show general business metrics, and shine a light on our performance. Although the queries only take a few seconds to execute, Redis sits in front as a cache for latency reasons.

BigQuery also powers any ad-hoc investigation that we’d like to do. This can be as wide ranging as building heat maps of fraud in specific areas, to visualising the effects of data breaches on different card issuers around the world.

What should I use it for?

Schema’d, immutable data — ‘a well defined, singular thing happened at time T’. This data plays well with BigQuery’s append only model, and can be sliced every which way extremely quickly.

Low level system logging and tracing. Using BigQuery allows us to collect much more data than we’d be willing to have to manage with an Elasticsearch, and for far cheaper. You may have never thought about this use case before — we hadn’t — but you should. Get your data out of log files and put it in BigQuery. Full text search, and the ability to query over raw JSON is invaluable for debugging issues.

data than we’d be willing to have to manage with an Elasticsearch, and for far cheaper. You may have never thought about this use case before — we hadn’t — but you should. Get your data out of log files and put it in BigQuery. Full text search, and the ability to query over raw JSON is invaluable for debugging issues. Map-reduce-esque processing of data. We’ve written a UDF that geohashes our GPS data for later processing. Our record was 100 million locations processed in 20 seconds. If your problem can be explained as a map function, and it doesn’t need to call out to any external services (e.g. user agent string parsing), then BigQuery is probably the fastest solution on the market.

Being quiet, reliable and not acting like a prima donna. You don’t even have to think about it from an infrastructure reliability point of view.

What’s harder to do with it?