The great people at Astronomer.io reached out asking to do a short interview about Apache Airflow and data engineering. Here are the few questions and along with my answers:

[Question 1] When are the next releases of Apache Airflow gonna drop, and what are the major features you’re excited about?

8.0rc4 (release candidate #4) just got voted positively by the Apache community, but got held up after folks at Airbnb found a few blockers. People are working at ironing these out and the release is right around the corner. We should expect 1.8.0 to come out either this week or next.

This is the first release that was not driven by Airbnb. Bolke de Bruin from ING in the Netherlands did an amazing job getting this release out. It represents a huge amount of work that increases as time since the last release grows. This is the first release since 1.7.1.3 that got out on June 13th 2016.

The CHANGELOG since 1.7.1.3 is gigantic. Here a some of the highlights from my particular perspective:

a multi-threaded scheduler, allowing for much faster scheduling cycles and fault tolerance while importing DAG files. Prior to this, a simple sys.exit() in a DAG file could take down the scheduler

in a DAG file could take down the scheduler replaced the Highcharts charting library by NVD3. Highcharts has a non-Apache compatible license and ripping it out takes us out of a legal grey zone

Unix impersonation and cgroups, allowing to run tasks as a specific Unix user, and specifying cgroups to limit resource usage at the task level. This keeps one task from hogging all resources and threatening an Apache Airflow worker

solid Google Cloud Services (GCS) integration with improved operators and hooks

a better more modular dependency engine, making for more maintainable and extendable code, allows for the “why isn’t my task running” new feature on the UI

fixed all issues around handling “zombie” and “undead” processes

better pool handling where there was over-subscription before in previous versions

new sets of operator & hooks

tons of usability and bug fixes across the board

We’re hoping to get a more steady release train going following this schedule, though there’s no official commitment to do so as of yet.

[Question 2] How has the transition from an internal tool at Airbnb to an Apache project been?

It’s been smooth. Apache scaled community contributions by allowing for many external committers to merge pull requests, accelerating the pace of change in the project. On the other hand it has slowed down the pace at which releases come out, forcing us to manage our own release branch, composed of the previous official release plus a list of “cherries” which represent the list of commits we bake on top of that release. We also now tend to develop on this internal branch, and push the PR out to the community once it’s been stabilized in production on our side.

We’ve enjoyed the help we’ve receive in driving the latest release, and watching the project thriving with limited involvement on our side. I used to personally review and merge every single PR, and that got way out of hand over the past few years. It’s great to see these virtuous cycles and this joyous “pyramid scheme” evolve over time.

[Question 3] How do you see Apache Airflow’s usage progressing? Over the next 5 years, what new applications of Apache Airflow will pop up?

The data infrastructure ecosystem has yet to show any sign of converging into something more manageable. It seems like we’re still in a huge phase of expansion where every new day bring new distributed database, new frameworks, new libraries and new teammates. As these systems get more complicated and evolve rapidly, it becomes even more important to have something like Apache Airflow that brings everything together in a sane place where every little piece of the puzzle can be orchestrated properly with sane APIs.

As Apache Airflow reaches feature completeness in the orchestration world, we can assume that integration with other system (hooks and operators) is an area of growth.

Where the original assumption was that Apache Airflow would act mostly as an orchestrator and not taking on real workload, it appears that many use Apache Airflow workers for more complex workloads like running R scripts, Python data processing tasks, ML model training, ranking, … While we internally encourage people to write services and leverage infrastructure like Kubernetes or Yarn for this type of things, it appears like there’s a need for Apache Airflow to grow in that direction as well, with support for containerization (please run this task inside this Docker container!), and resource management (allocate 4 cpus and 64GB of ram to this job please). We’re aware of the constraints that people may have in their environment and want to allow them to get the most mileage out of Apache Airflow. So if you Kubernetes cluster laying around we should make the most of it, if you don’t you we want you to be able to be able to run your jobs in Apache Airflow in the meantime.

I believe Apache Airflow is positioned to be the batch process orchestrator that will dominate the next 5 years. We have a solid technological foundation and a large, high-momentum community!

[Question 4] How do you view similar technologies in the space, such as Luigi, Azkaban, etc?

I personally haven’t used Luigi, Azkaban or Oozie so I’ll just be echoing words from refugees and castaways from these communities that have since joined the Apache Airflow community.

About Luigi, it is simpler in scope than Apache Airflow, and perhaps we’re more complementary than competition. From what I gather, the main maintainer of the product has left Spotify and apparently they are now using Apache Airflow internally for [at least] some of their use cases. I do not have the full story here and would like to hear more about it. I’m thinking that many of the companies choosing Luigi today might also choose Apache Airflow later as they develop the need for the extra set of features that Apache Airflow offers.

About Azkaban, it’s unclear to me who uses it outside of LinkedIn. It appears that the project doesn’t really have an active community at this point and I doubt that the project will start growing in that direction. Outside of LinkedIn, I’ve heard of a few anecdotal instances of companies running it, where someone close to the project at LinkedIn would have left the company and started at some other place wanted to stick with what they know.

Oozie is the piece of software I’ve heard most negative feedback on, ever. Try to find one Oozie user out there that is not in the core circle that has overall positive feedback on it. Just try! It may be one of these things where the project has solved some of the core issues people still complain about, but I think the name is tarnished and cannot be salvaged.

I firmly believe in configuration as code as a way to author workflows, and I see that Apache Airflow’s relevance in the modern data ecosystem is growing steadily. It certainly appears as though every startup that is serious about data & analytics in the Bay Area is running Apache Airflow at this point.

http://www.timqian.com/star-history/#apache/incubator-airflow&spotify/luigi&apache/oozie&azkaban/azkaban

[Question 5] How will data engineering as a discipline change startups in the next 5–10 years?

Modern startups don’t treat analytics and data as an afterthought anymore. They typically onboard their first data scientist early on, and the first wave of engineers will instrument some vital analytics in early versions of their product. VCs demand accountability and may provide the services of a “growth hacker” early on to provide guidance to startup and to measure their potential return on investment and see where to double down.

I think future startups will be catapulted up the data maturity curve with access to better, cheaper, more accessible analytics software and services. A lot of the work is already getting commoditized through open source packages, but there’s also a growing array of integrated vendor solutions like MixPanel, Interana, Optimizely, and a growing offering on cloud providers like AWS, GCS and Microsoft.

Things that used to be bleeding edge like real-time Olap analysis, anomaly detection, A/B testing at scale and user segmentation an cohort analysis are now at reach of any startup with a minimum of talent and proper founding.

While these offerings become more available, they become a necessity to stay competitive, and provide opportunities for agile startups to hop over established slow players in the field.