The comparative roles of data engineers and data scientists

During a recent Data Syndrome project, I transitioned from the role of data-scientist-slash-author-who-does-his-own-data-engineering to a formal data engineering role, leading a team of data engineers building infrastructure for a team of data scientists building a data product in the energy space. We led a crash course program to provide infrastructure, working only two weeks ahead of the data science team. During this project we learned the importance of instituting a formal QA process in between data engineering and data science. This post shares our lessons learned.

Infrastructure and Data Science

Infrastructure was once physical. When delivering a data product, you delivered two things: the infrastructure and the application which ran on it. Now that infrastructure is software, building a data product means building two concurrent software projects. The application depends on the infrastructure. When these projects are built concurrently, the infrastructure tends to limit the application. As a result, the bugs of the data engineer cast long shadows over the work of the data scientist.

Building good software is an inherently iterative process, just as good writing is mostly rewriting. The act of writing good software is in large part fixing bugs. In this regard, data Engineering is no different than software engineering. Most time is spent not implementing systems, but squashing bugs in those systems. That data engineering often involves composing and configuring tools at a “higher level” than pure software implementation does not change this fact.

That debugging is the primary component of writing software is not new. In his memoirs, computer science pioneer Maurice Wilkes wrote:

It was on one of my journeys between the EDSAC room and the punching equipment that “hesitating at the angles of stairs” the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs. — Maurice Wilkes, Memoirs of a Computer Pioneer

Building Good Tools

Software exists in a buggy state until it is exercised. Unit tests are one way of exercising software, but no test replaces actual use by the end user. In data engineering, as a result, many bugs emerge in tools when they change hands. That is, when they move from the data engineer to the end user, the data scientist. This is no different than bugs emerging when users begin to use a website, but the data engineer and the data scientist are more similar than developer/user, so this is not as obvious to the data scientist, who can experience many bugs when this transition occurs.

Many teams do not have adequate testing in place to ensure a smooth transition, as the practice of data engineering is less developed than pure software engineering. It helps if data engineers are similar to data scientists, which necessitates data engineers who analyze data and data scientists who can build systems.

Data Engineering and Data Science

The result is that a high proportion of blocker bugs emerge for the data scientist as he consumes and works on top of the platforms built by the data engineer. Any small problem in the tools supplied by the data engineer block the workflow of the data scientist, who simply wants to be productive. This creates problems for the data engineer, because the proportion of bugs that are blockers for someone else determines your quality of life. In web development, for instance, most bugs are not blockers by virtue of being too difficult to discover to significantly hamper website operation. They are filed, they are addressed piecemeal, often by a different team. In data engineering, if an expensive data scientist can’t work, it is an emergency. In data product teams, data engineers act in a supporting role for data scientists, who experience any problems as bugs that block their workflow.

Data engineering is hard. Internal customers are inherently unrewarding as compared to external ones… external customers being the focus of the business. Internally facing services are seen as costs, not sources of revenue. It is easy to create a cycle doing iterative data engineering where the data scientist is perpetually dissatisfied with the systems supplied by data engineering, as they have been created in an ad hoc manner without the infrastructure, tools, support and culture of primary application development. Context and process are critical to maintain sanity in a data engineering team.

Test Your Infrastructure

Without a solid process for testing infrastructure set between data engineering and the data science that follows it, significant delays will be experienced by data scientists as they encounter bugs while implementing features on top of untested infrastructure.

Allocate a shake out period for testing infrastructure before data scientists use infrastructure

This might take one or more weeks, so figure it into your schedule. You will experience this delay whether you schedule it or not, and a formal process will be more efficient than an ad hoc, informal one. In this way you can minimize the total time spent addressing bugs in infrastructure.

Going forward, we’ll be involving a QA team capable of working with infrastructure and systems to test our infrastructure before we deliver it to data scientists.

Crash Course Capability

Crash course data engineering is not something every team should pursue. If you have the time to run separate programs, you should do so. If time simply doesn’t permit that, it is possible to build data engineering as you go, but your program won’t run smoothly without a formal shake out period between data engineering and data science in which a formal testing process exercises the systems you’ve built and works out the bugs.

Addendum

Thanks to Josh Wills and Jesse Anderson for reading drafts of this post, providing invaluable feedback and several of the best sentences in the post :)

Shameless plug: need crash course data engineering? The Data Syndrome team of data scientists and engineers is available to build out your data platform to support the work of data scientists building data products and systems. We also build complete data products as a service. Finally, we offer training in Agile Data Science for all members of data science teams.