Recently while out with a group of coworkers, I was being introduced to a new colleague and the conversation went something like this: “And this is Hussein, he’s a data scientist.” I am, however, a data engineer and not a data scientist. That same evening, the situation repeated itself with a different crowd: “This is Hussein, he’s in data… wait what do you actually do?” This wasn’t the first time I had this conversation, and it’s safe to say it wouldn’t be the last time either. This confusion within data roles and the data ecosystem as a whole led me to writing this article, to hopefully provide a more detailed explanation as to what a Data Engineer actually does.

Introduction

Data-driven decision making is becoming an increasingly popular philosophy for organizations to adopt. Companies have access to exponentially increasing amounts of data and expect to gain insight and value from it all. We’ve all heard by now how Data Science is the sexiest job of the 21st century. However, it’s a little known fact that the data ecosystem is supported by a diverse array of professional expertise, without whom data scientists would not have the resources to do their job.

Data analysts and business intelligence experts have existed long before the age of data scientists, and will continue to exist for as long as we have analytics data. Similarly, ETL developers also play a key role in the analytics ecosystems, and are often specialized in specific warehousing technologies and query languages. The focus of this article however, is on another relatively young role at the center of the big data movement — the data engineer.

While much focus and fanfare is given to Machine Learning (ML) and Artificial Intelligence (AI), the data scientists and ML/AI experts first need good data to work with. To take from a popular expression: “Garbage in, garbage out”. Essentially, if the data used in machine learning models is of poor quality, then so too will the output. Data engineers are the ones tasked with building pipelines that produce good input data.

The Anatomy of Data Engineering

O’Reilly, an online learning platform, suggests two main areas of focus for data engineers:

Building and maintaining an organization’s data pipeline systems

Cleaning and wrangling data into a useable state

In essence, these two areas of focus combine three existing specializations — software engineering, by virtue of having to build data pipelines and systems; DevOps engineering, by virtue of having to maintain data infrastructure; and data analytics, by virtue of having to clean and wrangle data. This can serve as an alternate definition of the data engineer: someone with the expertise of a software engineer, a DevOps engineer, and a data analyst.

At SSENSE, the data engineering team is tasked with building and maintaining a data lake. Our primary stakeholders are the data reporting team and the data science team, both of whom work closely with us and rely on us for simple and convenient access to refined data.

With that said, what are the skills required to be a data engineer? Previously, we spoke about the skills needed to be a developer at SSENSE. Most, if not all, of these skills apply to data engineers as well. One of the most common patterns we see in applicants is a focus of expertise on cleaning and wrangling data, but a lack of acumen in concepts that are considered fundamental in software engineering, such as version control, continuous integration, and code reviews.

Occasionally, we also run into the opposite problem of candidates being skilled software engineers, but lacking SQL and data wrangling proficiency. While being a proficient developer is required, so is a deep understanding of data wrangling techniques, and a thorough understanding of SQL, given its importance and widespread use across the industry.

Moreover, to quote an article by James Furbush on O’Reilly, good data engineers need to have “a deep understanding of the ecosystem, including ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g. Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu). They should know the strengths and weaknesses of each tool and what it’s best used for”. Beyond the technical aspects, data engineers must also be adaptable and keen learners, because the tools and technologies evolve at a rapid pace and there are, as of yet, few industry standards.

The work of a data engineering team can be integral to defining and establishing a data culture within an organization. As such, data engineers should also be expert communicators, in order to establish effective relationships with stakeholders and evangelize data driven initiatives across the organization.

In an organization with more mature data, the role of the data engineer may become more specialized to focus on building and maintaining specific pipelines while working with a specific team, or towards maintaining larger infrastructure or databases. For example, an organization with complex recommendation engines — think Netflix or Spotify — needs stable pipelines to deliver input data to the teams responsible for applying models and algorithms to produce recommendations. In a similar vein, an organization that is deeply reliant on dashboards and visualizations produced by a data warehouse needs to be certain that this warehouse is optimized and able to meet the demands of all its users.

Thinking Long Term

With the role still evolving constantly, what does the future hold for data engineers? One thing is for certain, as our rate of data production accelerates, the need to efficiently manage this data will accelerate as well. The volume, variety, and velocity of data will continue to increase and the tools to handle them will evolve. While there will be an unquestionable need for data engineers in the foreseeable future, they will have the challenging task of keeping up with this breakneck pace of change.

For data engineers with an aptitude for machine learning, a new specialization is emerging which also promises an exciting long term career — the role of a machine learning engineer. Machine learning engineers can be thought of as data engineers with deep domain expertise in production-grade machine learning. They can be responsible for developing and maintaining the full lifecycle of a model, from training to production. In the age of real time recommendations, predictions, audio-visual recognition, and other feats of artificial intelligence, the demand for production-grade machine learning seems never-ending.

In short, there’s no shortage of work up ahead for data practitioners who are also proficient software engineers. Thanks to which, the nascent and often misunderstood field of data engineering seems to have a promising future.