Introducing Dagster

An open-source Python library for building data applications

Today the team at Elementl is proud to announce an early release of Dagster, an open-source library for building systems like ETL processes and ML pipelines. We believe they are, in reality, a single class of software system. We call them data applications.

What is Dagster?

Dagster is a library for building these data applications.

We define a data application as a graph of functional computations that produce and consume data assets. In a Dagster-built data application, business logic can be in any tool; the graph is queryable and operable via an API; and actual execution is on arbitrary compute targets.

Builders can use the tool of their choice — e.g. Spark for data engineers, SQL for analysts, Python for data scientists — all while collaborating on the same logical data application. They do not have to abandon all their existing code or investments in those tools.

By adopting this library, builders and operators gain access to new tools, built on an API. These tools are meant for visualization, configuration, local development, testing, monitoring, and so forth. Because these tools built on an API, there is the opportunity to build an entire tooling ecosystem around this library, not just a set of first-party tools.

Dagster’s computational graphs are (a) abstract and (b) queryable and operable over an API, and therefore can be deployed to arbitrary compute targets. Example targets include Airflow, Dask, Kubernetes-based workflow engines, and FaaS (functions-as-a-service) platforms. This means that regardless of the physical compute infrastructure, builders and operators can both benefit from the shared programming model and tools.

We believe that adopting Dagster will immediately improve productivity, testability, reliability, and collaboration in data applications. If broadly successful, it will lead to an entirely new open ecosystem of reusable data components and shared tooling.

The rest of this article will dive into the genesis of and inspiration for this project, the unique challenges of building data applications, more details on Dagster itself, and our road ahead.

Where did this come from?

I’ve been working on Dagster for over a year, but the bulk of my career was spent at Facebook on our product infrastructure team. Product infrastructure’s mission was to make our product developers more successful and productive. I worked up and down our technology stack, and ended up creating GraphQL, now a successful open source technology used by hundreds of thousands of developers.

But Dagster is not a technology for product developers. It is for data scientists, data engineers, analysts, and the infrastructure engineers that support them.

The move from focusing on product infrastructure to focusing on data infrastructure was an interesting transition. I left Facebook in 2017 and began to explore what to do next. As I was talking to leaders and practitioners inside Silicon Valley and in more traditional firms, the same refrain kept on coming up over and over again:

“Our data is totally broken”

My immediate reaction was confusion: How does one break data? I quickly came to realize that it wasn’t a technical or engineering problem statement. Instead, it was an instinctive recognition that something is wrong at a systemic level. Data integration, analytics, and machine learning are simultaneously some of the most important and least reliable systems in the modern enterprise.

It is difficult for leadership to get engineers to work on data management problems because they aren’t considered glamorous. Further compounding the problem, engineers and non-engineers who do engage report that they feel as if they waste most of their time.

How do they express this? If you’ve been to any conference with data engineers or data scientists, you’ve probably heard someone say something like:

Practitioners who aren’t Borat say something more like:

“I spend 80% of my time cleaning the data, and 20% of my time doing my job.”

Taking this statement literally, it would be logical to focus exclusively on making data cleaning faster. However this would be the classic mistake of blindly accepting what people say instead of figuring out what they mean.

They say they waste their time data cleaning, but what they mean is a whole host of other activities: Rolling their own custom infrastructure, maintaining unreliable processes built atop untested software, and the instinctive — and accurate — sense that they are doing repetitive work that should not be necessary. This is not about the speed of data cleaning, but problems at a deeper, structural level.

Where has this happened before?

Travel back in time to 2009 and talk to a frontend web engineer, and you would likely hear them say something like: “I spent 20% of my time building my app, and 80% of my time fighting the browser.”

Sound familiar? Just as data practitioners do today, frontend practitioners said one thing but meant another: They said they were fighting the browser, but what they meant is that they were using the wrong software abstractions.

If you were to take that same frontend engineer in 2009 and show them the developer experience today, their minds would be blown.

“React has conquered the Web” — Laurie Voss, JSConfig 2019

While the browsers did get better, it was ultimately the software abstractions and the ecosystem around them that proved decisive. In particular, React. Released in 2013, React was critical to this transformation, and it now dominates frontend development.

React defined its domain well, and then was able to solve entire classes of problems within that domain:

A [React] program is one that predictably manipulates a complex host tree in response to external events like interactions, network responses, timers, and so on. — Dan Abramov

Describing React in full is well beyond the scope of this article. For that we recommend:

React provided a novel, well-designed, higher-level component model over the native browser APIs. React took more formal engineering principles — functional programming in particular — and adapted them in a way that was intuitive to practitioners. It did all of this while being incrementally adoptable in existing systems.

React respected and acknowledged this new, emerging engineering discipline, and recognized the true, essential complexity of this class of software. These engineers were no longer just cobbling together scripts to animate a website; they were building fully-fledged frontend applications.

What does this have to do with data?

At Elementl, we believe that data processing is both in need of — and on the cusp of — a similar transformation that frontend needed nearly a decade ago.

Historically, these data processing systems have been organized as a set of jobs or scripts, loosely stitched together with a workflow engine. Or, they were assembled in a highly constrained, graphical tool or development environment meant to “abstract” away the engineer.

In modern systems, we believe they are more appropriately thought of as data applications.

They are complex pieces of software difficult to author, test, and operate. They are built collaboratively by a wide variety of personas using a vast array of heterogeneous tools. They are mission critical to businesses whose downtime can result in massive costs, convenience, and loss of efficiency. And the current software meant to structure these systems is woefully inadequate to the task at hand.

What is a Data Application?

Data Application (noun): Graphs of functional computations that consume and produce data assets.

There is a lot of terminology in this domain: ETL, ELT, ML pipelines, data integration systems, data ingestion, data warehouse builds, and so on and so forth. We believe that most of this terminology is outdated or duplicative: All of these terms in reality encompass a single, well-defined category of software systems.

Take ETL (Extract-Transform-Load). Historically this term was used to describe the process and tools used to transform — in a single batch process within a single tool — the data in a relational database with well-defined schema to a similarly schematized data warehouse structured for efficient analytical queries.

Traditional ETL tool: graphical, inflexible, proprietary, and vertically-integrated

Today’s so-called “ETL” in no way resembles that. Practically speaking, it is shorthand for any sort of data processing. It typically has many stages of processing and materialization, in many different languages, runtimes, and tools, dealing with the full range of un-, semi-, and fully-structured data.

Given this redefinition, a modern-day “ETL” process has virtually the same structure as a typical SaaS (software-as-a-service) integration or an ML pipeline: they are all graphs of functional computations that produce and consume data assets. Only the final output differs: an ML pipeline the final step produces a model, whereas the final step of an ETL or a data integration process produces a dataset.

This class of software is increasing in both prevalence, importance, and complexity. Data, analytics, and machine learning are only becoming more widespread, valuable, and demanding over time. SaaS integrations are only increasing in number — a typical modern business uses dozens if not hundreds of SaaS services — and complexity. Finally, data applications also are ideal candidates for execution on emerging cloud technologies such as functions-as-a-service (FaaS) and infinitely elastic, interruptible runtimes. Both the dynamics of cloud computing and the increasing importance of data processing will compel more software to be written in this form over time.

What makes it hard?

Creating reliable data applications is a software discipline with its own unique challenges. Many of the best practices of generalized software engineering are not directly transferable to the data domain, and specialized approaches are needed. We’ll discuss three of the properties that are unique to the domain: their (1) uncontrollable inputs; (2) multi-persona and -tool nature, and (3) how difficult they are to develop and test.

Uncontrolled inputs

Data applications differ from traditional applications in that data application authors typically have far less control over their inputs. In a traditional application, if the user inputs data, the application can refuse to do the requested computation, present an error to the user, and have her re-enter the data.

This is not possible in data applications. Data applications are ingesting data from systems or processes that they do not directly control. If unexpected inputs break the computation, you can either update the upstream input — which is rarely possible — or update the computation — which is what almost always happens. Software abstractions and techniques in data must account for this unfortunate reality.

We believe that data quality tests — known as expectations within Dagster — are the critical tool for managing the complexity of data within these systems. An inspiration in this area is Abe Gong and the team working on Great Expectations. See this article for an excellent discussion of this issue:

Multi-persona, multi-tool

The world of data is very heterogeneous, and that will not change anytime soon. Data applications are built and supported by diverse teams: business users, analysts, data scientists, data engineers, machine learning engineers, and traditional engineers. Each of those user types has domain-specific tools and languages that they are accustomed to and productive in. Data engineers might use Scala/Spark, data scientists might write Python within Jupyter notebooks, and analysts likely use SQL — all logically within the same data application.