Portability

When writing a Beam pipeline you have to choose an SDK: the language you write your pipeline in (Java, Python or Go) and a runner: where to run your code (Dataflow, Flink, Spark, etc). Each of these runners and SDKs currently support a different set of features because each SDK-runner combination requires non-trivial implementations in the framework on both sides. Originally runners were written in Java, which means a lot of extra work has to be done to support other SDKs. When a new feature is introduced in Java, this also requires a new implementation in Python and Go to support the same feature.

The portability framework aims to rectify this situation and provide full interoperability across the Beam ecosystem, which means to provide interoperability between SDKs and runners. The portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interop layer — called the portability API — ensures that SDKs and runners can work with each other uniformly, reducing the interoperability burden to a constant effort. It notably ensures that new SDKs automatically work with existing runners and vice versa.

This portability is a work in progress. The Flink runner for example currently comes in two flavors: a legacy runner, which only supports Java and a portable runner, which supports Java, Python and Go. The goal in the long term is to replace the legacy runner with the portable runner. During the keynote an example of SDK portability was also shown: cross-language pipelines. Here it is possible to use a Java IO connector, directly from within a Python pipeline. The demo showed the use of the Kafka IO, which is currently not supported in Python.

I think portability is one of the coolest things to look forward to in the Beam ecosystem. As a Python developer, I am missing out on a lot of features that are only supported in Java. Cross-language pipelines would introduce a big change in this.

Cross-language pipelines: example of how to Java code from within a Python pipeline.

Tensorflow Extended

TensorFlow Extended (TFX) is an open source, end-to-end platform for deploying production ML pipelines. Pipeline processing is a core requirement of any production ML platform.

A lot of the different components of TFX use Beam for running tasks: data ingestion, data validation, data transformation and model analysis. Beam enables a high degree of scalability across compute clusters. TFX uses the Beam Python API and it provides support for the runners that are supported by the Python API. Beam provides an abstraction layer which enables TFX to run on any supported runner without code modifications.

An overview of the components of TFX, and which of them use Beam in the backend.

At ML6 we already use TFX because TensorFlow Serving provides a powerful framework for deploying our ML models and for data preprocessing. Checkout our blogposts below:

Time Series

Apache Beam provides different syntactic constructs to handle streaming data. It supports handling windowing, out-of-order data and late data. One scenario however that was not supported is how to handle data that is missing. E.g. if you are processing time series and you want to indicate that there is a gap in the data. You would have no event present in that specific window but you still want to output data.

There are two ways to handle this. The first one is Looping Timers, currently available on the direct runner. With this solution we create a timer which will create a “null” event each x seconds, and this for every key in your PCollection . This solution uses global windowing and a state in your pipeline.

The second one is Validity Windows, which is currently still theoretical. Validity Windows slice time into windows in which a certain value is valid. An example of this is to convert EUR to YEN. One stream processes the conversion of euro to yen values while another stream updates the conversion rate. Each conversion rate is valid in a time window until the next one arrives. As values can arrive out-of-order, windows need to be split as conversion rates arrive. Since Beam does not provide support for shrinking windows, this is not yet implemented.

Looping Timers is what is possible with Beam today while Validity Windows is what Beam wants to make possible in the future.

Summary of ways to handle missing data in time series.

Schema-aware PCollections

Apache Beam doesn’t have any knowledge of the actual structure of the records in a PCollection, and little understanding of PTransforms. In practice, most of the PCollections are schematized: Avro records, BigQuery rows, and even POJOs and case classes. Many operations are performed on structural records: filtering by field, grouping by a specific field, and so on.

Schema-aware PCollections let you define a schema for your PCollection by having a PCollection of an object that extends the Schema class. This allows you to then automatically write to e.g. BigQuery without converting your PCollection to a TableRow. Having Schema-aware PCollections also allows an easy way of doing joins and group by.

Example of how you can use Schema-aware PCollections to filter your data.

Beam SQL

Beam SQL allows a Beam user (currently only available in Beam Java) to query bounded and unbounded PCollections with SQL statements. When running your pipeline, your SQL query is then translated to a PTransform . It is possible to query values inside a PCollection by using schema-aware PCollections. Apache Calcite provides the basic dialect underlying Beam SQL. Calcite is a widespread SQL dialect used in big data processing with some streaming enhancements.

During this talk, Beam SQL was demo‘d on a streaming pipeline. By using tumble_start you can do an underlying GroupByWindow on a PCollection to query the values from within this PCollection . When using count(*) this is translated to a Count PTransform .

Demo of Beam SQL

Beam SQL makes it really easy to visualize your data and to get the result of simple transformations on your data. It allows users to use the SQL syntax they are familiar with, right from within their Beam pipeline. For examples on how to use Beam SQL, see https://beam.apache.org/documentation/dsls/sql/walkthrough/.

Python 3 Support

And last but not least, one of my colleagues, Robbe Sneyders, gave an update on the status of porting Apache Beam to Python 3. At ML6 we use Apache Beam on Python and helped porting it to Python 3, since Python 2 is no longer supported after January 1st 2020. Currently Beam supports Python 3 as of the 2.11 release. Python 3.5 to 3.7 is mostly ported except for VCF IO and typehinting. We are currently also working on adding support for Python 3 specific syntax, such as keyword-only arguments.

The biggest change between Python 2 and 3 is the different way that bytes and strings are handled. In Python 2 strings can be either bytes or encoded data. This leads to a lot of problems, especially when working with non-English text data. In Python 3 strings are unicode by default.