Prefect (data workflow)

https://docs.prefect.io/core/tutorial/01-etl-before-prefect.html

Origin and First Impressions

Prefect was started by a core contributor to Airflow — Jeremiah Lowin, sometime in the past few years. The sense that I get is that it’s supposed to be an Airflow 2.0, addressing some of the shortcomings of Airflow. Prefect’s written extensively about their tool in comparison to Airflow. Their business model looks to be a Freemium one, and they’ve open sourced most, if not all, of the core parts of their product ecosystem. Getting through the tutorial was relatively straightforward, and I got the sense that I was just scratching the surface of Prefect. All the while, feeling like I was just writing regular old python with a few decorators and context managers.

Tasks / Workflow

Prefect has decorators for functions (‘@task’) and Task classes you can inherit from for your units of compute. It seemed dead simple to me to incorporate and get started adopting their tools. There is check-pointing around the state after each task runs. This encourages small tasks, and data can flow between tasks. Perhaps you could put runtime data QA checks with a tool like great expectations there. There’s also caching support around tasks, which means it’s easy to avoid recomputing tasks unnecessarily. Tasks can also optionally receive inputs and parameters, and produce outputs. You’re able to specify LOTS of logic in DAG generation / flow via signals, triggers, etc. And, they have lots of specific kinds of tasks ready to go, out of the box. Tasks get put into pipelines (called “Flows”), which seemed simple enough to use.

Unit testing seems reasonable to implement — at least from the examples. The docstrings didn’t seem to be super helpful, but the tool does seem to be type annotation aware and friendly. The metadata around tasks and the workflow seems fairly rich as well. Their CLI seemed reasonable, and it looked like it connected with their cloud services.

Installing the UI was fairly straightforward. It pulled lots of dependencies into local docker containers, and seemed to start up without error. I had trouble running the tutorial example flow in the UI though. It looked to me like this was a premium only feature. I’m seeing that they have open sourced their UI recently, too. Update: I needed to run a command first that wasn’t clear to me in the tutorial ( prefect backend server ). Once I did that I was able to see runs I had completed locally. The UI felt to me like a flashier, upgraded version of the Airflow UI.

Two nice features that I was able to get working without difficulty were using a different Executor back-end and the Scheduler. Prefect uses Dask as an Executor (easily parallelizable) of tasks by default, but this system looks to pluggable with a relatively straightforward API (as far as I know, their system, as a whole, is fairly pluggable). And, their Scheduler provides many different configurations out of the box.

Concluding Thoughts

Prefect seems very promising, packed with features and well thought out, with a simple, beautiful and powerful abstraction layer; it really did feel to me like a rewritten Airflow version 2.0. They want you to be able specify any kind of workflow you want without difficulty, no matter how complex they get. The only thing I’m a little confused about is their business model. Update: I heard back from Jeremiah that they recently evolved it to a more traditional open-core model. In his words:

“We offer an open-source workflow system (the UI being the last piece, released three weeks ago), and we also offer a managed version of our orchestration platform (server + UI), which includes some more advanced features interesting to large companies (like global concurrency limits, roles and permissions, etc.). The system is designed so that no matter whether you run it locally or through Prefect Cloud, your code and data are always executed on your own infrastructure — that’s why we work so hard to make the back-end switch a single CLI command. By doing this, we made it easy for companies in regulated industries, especially financial services and healthcare, to adopt Prefect by running it locally to PoC and then immediately transition to our managed product.”