“Data Engineers are specialized software engineers that enable others to answer questions on datasets within latency constraints.” Nathan Marz

Inventor of Apache Storm and the Lambda Architecture

Author of “Big Data” and Insight Advisor

Since 2014 Insight has helped 300+ of the brightest software engineers and academic programmers transition into top Data Engineering roles and we’ve learned a lot about how to make the transition efficient. We’ve received many requests over the last two years from people wanting to learn what they can do to improve their data engineering skills and chances of getting into the Fellows Program so we wanted to share some of our thoughts here.

At Insight, Fellows learn to think as a data engineer and are exposed to many open source distributed tools used in the industry through building a scalable data platform. Below you will find suggestions and resources that have helped our Fellows prepare for this transition.

“The number of data engineers more than doubled from 2013–2015. And based on the job posting data from earlier, this growth isn’t about to slow down.” Stitch Data

The State of Data Engineering

Data Engineering Challenges

One of the best ways to begin understanding data engineering is to familiarize yourself with real-world challenges by learning from data engineers in the industry.

Big Data by Nathan Marz

We also highly recommend Big Data, the book from Apache Storm and Lambda Architecture creator, Nathan Marz. Our Fellows have found it really helpful and the first two chapters are available free online.

Why?:

Marz’s book provides a high-level view of common data engineering challenges, which serves as a great introduction to DE.

Action Item: Read -at least- the first two chapters of Nathan Marz’s book on Big Data.

Components of a Data Pipeline

Once you’re familiar with the challenges of a data engineer, the next step is to get acquainted with the components of a data pipeline and the technologies used to handle large, real-time data sets:

Data Ingestion

Distributed File Systems

Batch Processing

Real-time Processing

Databases

Web Application Frameworks

During the program, our Fellows take a deep dive into some of these technologies, but it is very helpful to spend some time exploring a few of these tools at a high-level to understand how they fit into data engineering ecosystem.

Why?:

As a data engineer, it’s important to understand the major components in a typical data pipeline and being able to assess the pros and cons of different tools for specific tasks.

Action Items:

Review the major components of a data pipeline as described on the Data Engineering Ecosystem page on our Wiki and in the The data engineering ecosystem in 2017 post on our blog.

Getting started with AWS and DevOps

One of the hurdles in learning data engineering is setting up a distributed cluster to develop on. Amazon provides a free-tier which can be used to learn the distributed technologies, rather than just using your local system.

During the session, Insight fellows commonly use AWS to run more complex applications on distributed clusters.

Why?:

AWS is one of the most widely-used cloud platforms and experience with it is a hugely valuable skill. The more familiar you are with current cloud offerings the more effective you can be as a data engineer.

Action Items:

Work through the Introductory Development Setups for Spark on our blog and run the Word Count example from our wiki page. Bookmark AWS in Plain English as a cheat sheet for later.

If you’re unfamiliar with networking, read through DigitalOcean’s introduction to networking terminology and protocols. DigitalOcean has some of the best tutorials on networking and operations — they’re short and digestible.

If you want to learn more about networking on AWS, take a look at J Cole Morrison’s analogy about Virtual Private Cloud (VPC) as a city.

Computer Science Fundamentals

While it is good to know the basics of some popular technologies, it is better to know the underlying computer science fundamentals behind them. For example, it is better to know that many modern NoSQL databases are implementations of a distributed hash table, and understand why these are used.

Tip: Do not underestimate the importance of computer science fundamentals whose concepts are re-used and developed upon in distributed systems.

Why?:

We have found that our most successful Fellows are able to prove to employers they have a solid understanding of computer science and database fundamentals.

Action Items:

Data engineers should have a very good understanding of data structures and algorithms. These can be reviewed by watching all the lecture videos from the first three courses of Tim Roughgarden’s Algorithms Specialization. Other resources you may consider are Problem Solving with Algorithms and Data Structures in Python or the videos of Data Structures and Algorithms Specialization. At the heart of all distributed systems are core data structures and algorithms optimized for performance and robustness at scale. The better you understand fundamental concepts, the better you’ll be able to apply that knowledge to distributed systems — and impress prospective employers along the way.

As part of the interview prep section of the program, you’ll work on solving coding challenges similar to what employers have posed to job candidates. Get a head start by going to leetcode.com and solving the problems listed there.

Coding Fundamentals and Best Practices

As a data engineer, you will most likely be part of a team of software engineers and data scientists working on shared projects and code bases. Using software development best practices will help you to become an efficient team member.

Why?:

Navigating Unix and writing clean, well-documented code with version control will be important as you work with others and communicate with the rest of the team. Prospective employers also may evaluate you on your code (i.e., perusing your Github repo).

Action Items:

Go through this GitHub tutorial if you are not familiar with version control.

Make sure you are comfortable with the Linux terminal commands in this sheet.

Look up best practices of your chosen language, e.g. Google’s Python Class.

Learn a New Language

Many of you are familiar with Python, Java or perhaps C++ but you’ll find that learning a new language may be to your advantage. Data engineering tools, such as Hadoop, Spark and graph libraries are written in Java and Scala. Some tools have a Python wrapper, so it’s good to be familiar with Python, but the newest updates often occur in the native language first.

Why?:

A functional understanding of Java and Scala may give you a leg up in understanding those tools, and in some cases, are the only alternatives, such as if you wanted to use Spark’s GraphX library.

Action Items:

Learn Java and Scala by following the examples in the posts on Java from Python, Scala from Java, Scala from Python — depending on your background.

Read this post about the various programming paradigms, focusing especially on functional and object-oriented programming. Also check out this excellent post on functional programming’s importance to data engineering from Maxime Beauchemin’s, the creator of Airflow and Superset.

Bonus Item: Get a sense of Scala with some of the problems in 99 Scala Problems, or re-write one of your favorite algorithms in Scala. You may also enjoy Twitter’s Scala School as a reference.

Extra materials

If you’ve familiarized yourself with all the materials above here are some extra useful topics you should continue with.

Action Items:

Being familiar with SQL and semi-structured data formats (XML and JSON) is crucial for any data engineer. These topics can be reviewed by going through the Mode Analytics or SQLZOO tutorial. Also do the exercises in Stanford’s self-paced Database Course — at the very least the SQL mini-course.

Further Reading

Here are some of the primary news sources read by people in tech. We recommend starting to skim these resources every few days: