Want to be a data engineer? Here’s what you need to know

Gianluca Ciccarelli, Data Engineer at Bolt

A little bit more than a year ago, I didn’t know what data engineering was. Today, I am working with the leading data science team in Europe in one of the fastest-growing ride-hailing startups in the world. I help to collect, filter, and transform raw data to make it useful for some of the core business functions at Bolt (formerly known as Taxify). In this post, I am sharing my learnings about the goals and challenges that data engineers face daily.

Before joining Bolt, I’ve always been a software engineer. I used to have a general understanding of what Data Science and Analytics were. What I didn’t know was how important it is for the data to respect a few requirements in order to be used effectively, and how engineers can help make the data readily accessible.

Data is an asset

Bolt is a relatively small company, but we deal with a significant amount of data. It enables us to build predictive models and use them to improve our services. For riders, this means that we can more accurately estimate the time it takes to get a car. It also means, for instance, that we provide a fair price according to supply and demand, which benefits both drivers and riders. We feed it to our Business Intelligence tools that in turn help to evaluate the outcomes of marketing and UX experiments, and to make informed decisions about the directions of our expansion. We can provide enough context to our Fraud Prevention team for them to make a confident decision about what constitutes fraud and what doesn’t. Developers can understand if the traffic coming to their API is flowing in unexpected ways, or if their correct operation requires more hardware.

The possibilities offered by harnessing the data are only limited by our skills and creativity. We are constantly on the lookout for new methodologies, platforms, and tools to help us leverage and make sense of this data. We need all these tools to support a process sometimes referred to as Extract-Transform-Load, or ETL for shorts. But ETL is only one part of the way we pre-process data. We also need to deal with stream processing and general data creation via simulators. The team that holds this responsibility is called Data Engineering (DE). Our full job description is broader than that, but before giving one, I’ll introduce the tools of the trade.

Data Engineering: Goals & Challenges

Data comes from a variety of sources, internal and external, and in different formats. For our users to be productive, data needs to be accessible uniformly. While understanding the nature of the data they deal with, our users should not be held responsible for its quality. The collected data might end up containing duplicates, being outdated, and generally being corrupted in a number of ways; one of the jobs of a data engineer is to prevent these issues and to minimize their occurrence.

Data Engineering is a relatively young field. As in any specialization, if you want to do it well, you need to learn constantly and do so efficiently enough to be able to make valuable contributions. But as in any young field, best practices and guidelines and general documentation are scant and scattered. A certain comfort with uncertainty and fast change might be of help.

The data that we deal with in our company is collected in a heterogeneous and distributed storage which the industry commonly refers to as data lake. One of its main components is a relational DB (RDB), which offers us a number of well-known guarantees. It turns out, though, that it is not very suitable for big data analysis: the kind of query needed for building statistical models is not the one your generic RDB is optimized for, to the frustration of the model tinkerers. The second fundamental building block of the data lake is the Amazon Simple Storage Service (S3). It is great to store enormous amounts of data, but it is not meant for data analysis. Enter: Amazon Redshift. Redshift is a column-oriented database management system optimized for aggregated data analysis. It lives in the cloud. It has great performance if used wisely, but it can become a slug if you don’t maintain it efficiently. And it comes at a cost that roughly depends on both the sheer amount of storage you use, amount of memory and on the CPU power you need for your queries. This means that you want your SQL to be efficient, and to store only the data you need for analysis.

What I mean to describe in the following is a part of our current architecture, but I want you to understand that we consider this only as a temporary and disposable steppingstone in our current pursuit of a state-of-the-art Data Science architecture.

In our workflow, Redshift introduces a number of advantages but also of problems. The RDB is only one of the sources that feed it. As data needs to be copied from it, and then transformed, its availability is subject to the lag due to the physical time needed to transport data over to Redshift, plus the processing time. And when things go South, that can become days. And not all data is useful for analysis: we select which columns we want over to Redshift so that the tables at the destination are generally narrower than the ones at the origin. Sometimes we make mistakes, so we find ourselves, say, with nullable columns that only contain null values. As we accumulate data over time, we might find ourselves with data which is duplicated.

The RDB exporter in the diagram below is the software that we develop and maintain to take the data from RDB and push it to S3. Our services also write data to S3, and some of it is of interest to Data Science Team. The S3 importer is the software service that takes the data from S3 and copies it to Redshift, a process known as LOAD (the “L” in ETL).

A major source of problems is the impedance mismatch between the RDB and Redshift. A primary key constraint in DB can’t be relied on in Redshift, and not because it’s not there, but because it is misleading: as the docs state, this constraint is informational only and is not enforced (even though the query optimizer uses the information). But there are also concerns which need to be addressed in Redshift, but don’t exist in relational DBs — for example, the concept of sort key and distribution key, related to the way the data is physically stored in the high-availability architecture implementation at the heart of Redshift itself. Moreover, our data come from different sources, as said above, but it needs to be stored uniformly.

The Data Engineering team’s broad objective is to make sure that the data we collect is well-formed, and that accessing it is fast and easy.

Data Quality

So how do we deal with our challenges? Glad you asked. Our data architecture is in continuous evolution, but what we have now relies heavily on a number of key components. First and foremost, we monitor our data. Is it growing faster than expected? Is it skewed, i.e., are we distributing it unevenly among the storage nodes? Are our clusters under a heavy load? Are they underutilized?

We make a number of checks based on our understanding of all of it. This understanding sometimes crystallizes in automatic checks: once we know enough about what the data should look like, we try to automate as much as possible and streamline the data.

Let me make an example. During an inspection, we find out that some rows are duplicated, via a custom query on an important table. We make the check a couple of times, then decide that we need to take measures against this occurrence. The query itself is of little use, because it is not always the same for all the tables. So we design a process to build a query to check for duplicates. And since the process is repeatable, we bake it into a service that runs automatically, periodically, and is connected to our alerting system, so when there are duplicates, we can take action about them.

For this to happen, sharing our understanding of data is of paramount importance. Progress in our efficiency can only happen via an ongoing discussion that leverages our different background and understanding. We encourage this by promoting an open model of communication, based on Slack, impromptu meetings, documents, coffee machine chatting, and whatever we can come up with, to make the information flow without being overwhelming. We constantly check our own understanding and compare it with the real data.

We also rely heavily on the feedback from our users. It happens that we discover data inconsistencies only because a manager was trying to analyze the data and the numbers didn’t add up.

Our knowledge is not fixed. This is true of everything in the software world, but it is even more true in a discipline with a short history such as DE. We constantly revise our architecture because what seems like a good idea now, will most probably become obsolete in a few months. Our understanding becomes deeper after visiting conferences, banging our head against the wall for a recurring problem, after a casual conversation where we have asked ourselves: Why are we doing it this way? “What if… ?” questions are usually the most useful ones.

We differentiate and specialize to leverage deep skills that include distributed systems, database and query optimization, and systems programming. At the same time, we try to share our learnings so that everyone knows what’s going on and can offer a fresh perspective. The direct contact with the Data Science team gives us a general understanding of the problems they are facing, so that we come up with specific plans to make them more manageable.

To sum up

I have tried to describe the job we do in the Data Engineering team. We massage data, put its pieces together to provide a different perspective, elaborate it, store it, manipulate it, and optimize its access. We also build pipelines to make the data flow from one sector of our architecture to another. We write efficient software, maintain our Redshift clusters, optimize and advise on queries, and constantly re-evaluate our process to make it faster and more accurate. To do this, we keep the discussion open, share our findings, and join conferences. I have also tried to provide examples of the challenges we face every day.

We are in the middle of a huge transformative process. We are piecing together a data science architecture that needs to sustain terabytes of data in a dependable and distributed way so that the data is at the fingertips of its users. Haven’t scared you yet? Then maybe you might consider joining forces with us.

About the author

Gianluca Ciccarelli works as a Data Engineer at Bolt. Together with the rest of his team, he is building the data pipeline that will enable Bolt’s Data Scientists and Data Analysts to make faster and better decisions in order to offer the best possible experience to both drivers and passengers. After a brief stint in Networking Research, he’s worked as a Software Engineer in the domains of banking, traveling, and Web browsers.