Identifying the needs

Having clear what the problems are, as step 2 we identified our needs.

Monitoring executions: we need a web app where to see the current and past situation of all runs. Centralised log access: we need a web app that allow us to access all logs. A new scheduler : we need a reliable scheduler system with UI, easily to consult and that would give the possibility to manually run tasks or, even better, single sub-tasks. Easy scaling up: we have to scale up in case of need without refactor\rethinking the workflow itself. Less programming languages: we need to consolidate and reduce the number of programming languages in use. Code versioning: we need to version the code behind each ETL in one place.

Point 6 deserves attention: it didn’t come straight from the analysis of current situation, instead it took time to understand that we actually need to model more precisely our WFs and have to version them. With that, we can minimise the danger of losing knowledge, history and code; with that we never lose again the control of ETL governance. But that means also that we need to define rules, best practices and guides to better preserve history and knowledge.

Another point we took into account, but we decided not to write down explicitly was:

7. Open source and accepted by the industry.

The reason why we didn’t list it with the others is that this point is broadly applicable to a lot of (maybe any) solutions, and it wasn’t specifically applicable to a WMS.

Alternatives

Exploring the web and making research, we came out with these possible WMSs:

Why we ended up with Airflow?

To be honest, it didn’t take a long time, after starting the study and comparison of these products, to identify significant limitation in respect of our choice, except for Jenkins.

The next image depict Airflow architecture, and that will clarify the most about the choice.

Airflow comes with these very important features:

It does have a scheduler

It does have a very useful web UI for monitoring, checking logs and rerun

It does scale up natively with different available executors (Celery, Kubernetes, your home made, …)

It has programmatically defined workflows, in the form of directed acyclic graphs (DAG) written in Python

What about Luigi? Well, Luigi gives you the possibility to write you DAG in Python just like AA, but (source):

UI is minimal, there is no user interaction with running processes Does not have it’s own triggering Luigi does not support distributed execution

With these missing features 4 out of 6 of our requirements are unmatched. So, we can’t pick up Luigi.

What about Oozie? This source was leading us in our study: similarly to Luigi, OOzie comes with a worst UI, but other aspects are better addressed. So in this case the bad news are coming from the underlying base language and the community: DAGs have to be written in Java\XML, with complicated workflow more difficult to define, and the community is less active.

So, having to choose a reliable system, for years to come, among the two, Airflow is still better.

What about Argo? Argo is a really great project, natively thought to be used with kubernetes. This is at the same its strength and its weakness: if you have already a kubernetes-based system, it could be the right choice for you. Otherwise, you should be careful: you can’t use it without kubernetes, and you need to plan the double adoption of Argo and Kubernetes (and as you may know, it is itself a universe). That’s basically the reason why we discard it: to not be locked with that container system.

Finally, what about Jenkins? Well, Jenkins is one of the leader when you talk about pipelines. A few years ago the possibility to run stages in parallel has been added, enabling the exploitation of its capability to run on many workers.

Thinking on the huge amount of plugins available, the community support, the documentation usable, honestly I’ve been tempted to use it to do what we need to do — and I’m still tempted.

At the end, we choose Airflow over Jenkins because the latter is natively thought for building (CI\CD) while the former uses DAGs and it is oriented to move and trasform data.