The industry demand for Data Engineers is constantly on the rise and with it more and more software engineers and recent graduates try to enter the field. Data Engineering is a discipline notorious for being framework-driven and it is often hard for newcomers to find the right ones to learn.

We at Insight offer a 7-week tuition-free Fellowship to help programmers transition to Data Engineering and have helped guide hundreds of Fellows overcome this exact hurdle. As part of their fellowship training and transition into data engineering, Fellows spend three weeks working on data engineering projects where they dive deep into these frameworks.

Here are the six most important and useful ones our Fellows work on during their projects.

Spark

Spark is one of the most popular tools in distributed computing and can be used for batch and streaming applications. Spark’s rich ecosystem and advanced APIs and libraries such as SparkSQL and SparkML make it one of the most powerful and flexible tools. Our Fellows have used Spark in a wide variety of applications from building a platform to discover influencers in GitHub to a real time app for parking spots in Seattle!

If you want to get started with Spark, check out this blog on how to setup your very own Spark cluster on AWS here.

Flink

An alternative to Spark, Flink has gotten a lot of traction in the Data Engineering community. While its ecosystem is not as rich, Flink shines with its different approach to unified stream/batch computations. Fellows used Flink for instance to build a real-time fraud detection pipeline where the focus was on low latency.

Kafka

Kafka started out as a fault-tolerant, distributed messaging and real-time data ingestion platform. However, it has evolved to a complete streaming platform capable of real-time analytics and high-throughput processing of data. Our Fellows love Kafka for its performance and ease of use — and have used it to build an ingestion platform for collecting data for autonomous driving and a real time app suggesting tags for your next StackOverflow post.

ElasticSearch

ElasticSearch is a popular distributed search engine built on top of Apache Lucene. ElasticSearch is also part of the so-called ELK Stack consisting of ElasticSearch, Logstash and Kibana. This stack is very popular to build a highly scalable logging framework to maintain web applications. Our Fellows have used ElasticSearch in a variety of projects and have even developed new plugins — check out this project where a Fellow added the K-nearest neighbor algorithm to ElasticSearch!

PostgreSQL/Redshift

PostgreSQL is a popular open source relation database. While NoSQL databases emerged with the advent of big data, relational databases remain widely popular and are still the best solution for many use cases. Our Fellows gravitate to PostgreSQL not just for its ease of use, but also for extension PostGIS, which adds unlocks powerful geospatial queries. This has been used by our Fellows to built platforms such as AirAware, an app monitoring air quality.

An accessible introduction into how to setup your own PostgreSQL database can be found here.

Redshift is an analytical database/data warehousing solution from AWS. It is originally based on Postgres (which is why we grouped these tools together), but has been greatly expanded and modified with a focus on support of performant analytical queries and advanced data warehouse features. Our Fellows have used it in their projects, often in conjunction with Spark, for the exploration of Reddit data.

Airflow

Airflow is one of the most popular workflow automation and scheduling systems. It manages all jobs in directed acyclic graphs (DAGs), enables it to be both accessible to new users while also supporting complex workloads. Our Fellows have used Airflow to build to provide financial analytics for crypto assets or have hosted a fully fault tolerant Airflow cluster.

If you want to get started with Airflow, we got you covered with our Airflow 101 article.