In a previous post I showed you how to get started with pyflink, but now that the proper release is out, I thought I would do a follow up post with v0.10.1, and some more complicated examples.

To start out with, as before, pull down and build flink from github.

git clone https://github.com/apache/flink cd flink mvn clean install -DskipTests

At this point the bleeding edge Flink build will be symlinked at build-target in the flink directory. You can start up Flink with the commands.

./build-target/bin/start-local.sh

You can right away see a difference from the older versions by checking out the new jobmanager UI at localhost:8081, which is much more fully featured and overall pretty than before. It will dynamically update, without having to manually refresh, and also seems to keep the data for completed jobs out of the box. If you've ever debugged a spark issue through their UI, you'll probably appreciate these two little features.

Anyway, in the first post we started with a really simple word count example. In this post, I am going to present a few more complicated examples, and talk a little bit about how to run them in a more programatic way.

Runner

Pyflink functions are required to be of a certain format. They have to be set up in a main block, and can have some arguments passed in as runtime parameters. This differs from how you would setup a SparkContext, for instance, where you can instantiate it pretty much anywhere. As an example:

import os import sys from flink.plan.Environment import get_environment from flink.plan.Constants import INT, STRING, WriteMode from flink.functions.GroupReduceFunction import GroupReduceFunction if __name__ == "__main__": # get the base path out of the runtime params args = sys.argv[1:] # set up the environment with a source (in this case from a text file env = get_environment() data = env.read_text('foo') # build the job flow data .map(lambda x: x, STRING) .write_csv('bar', line_delimiter='n', field_delimiter=',') # execute env.execute(local=True)

This code must then be called using the pyflink{2 or 3}.sh script, and the arguments must be passed in there:

pyflink3.sh example.py foo bar baz

So to integrate the pyflink job into a larger python system, we need to invoke this call via python. On top of this, the job itself is passed into flink in such a way that it's location on the drive is changed. So any file reading or writing that uses os.path to find a relative directory will fail, so we need to at a minimum pass in a base path as an argument.

To solve these two issues, I've added a runner.py script into my examples repository. In that, we encapsulate the systems calls into python functions and pass in the base path for the project so that the input and output files can actually be found by flink.

Examples

As for examples, we have 4 new ones. After a quick explanation, we will look at the resulting Flink plan generated in the UI.

Trending Hashtags

A very similar example to word count, but includes a filter step to only include hashtags, and different source/sinks. The input data in this case is read off of disk, and the output is written as a csv. The file is generated dynamically at run time, so you can play with different volumes of tweets to get an idea of Flink's scalability and performance.

Data Enrichment

In this example, we have row-wise json in one file, with an attribute field that refers to a csv dimension table with colors. So we load both datasets in, convert the json data into a ordered and typed tuple, and join then two together to get a nice dataset of cars and their colors.

Mean Values

Takes in a csv with two columns and finds the mean of each column, using a custom reducer function. Afterwards, it formats a string nicely with the output and dumps that onto disk.

Mandelbrot Set

Creates a Mandelbrot set from a set of candidates. Inspired by this post.

So there you have it, with a little bit of tooling, and the new release, you can use python to do some pretty cool things on huge datasets, very fast. If you'd like to see the source or contribute, this is all at:

https://github.com/wdm0006/flink-python-examples