Since we launched the Insight Data Engineering Fellows program in 2014, we’ve built relationships with over 75 teams in the data industry. We’ve discussed the latest challenges faced by engineers on top teams like Facebook, Airbnb, Slack, The New York Times, LinkedIn, Amazon, and Tesla. Additionally, our ever-growing alumni network now includes 150+ engineers and 750+ data scientists that regularly share their experience with the Insight community. Thanks to this strong community, we have a unique position to detect emerging patterns in the technologies used in the field.

We’re constantly exploring ways to contribute this knowledge to the next generation of data engineers and the broader data community. We’ve developed a more interactive version of our Data Engineering Ecosystem Map. This iteration provides a streamlined view of the core components of data pipelines, while enabling deeper exploration of the complex world of distributed system technologies.

Trends in data engineering

Through updating this map, we’ve reflected on recent changes in the tools and services available to today’s data teams. We’ve highlighted some of the noteworthy trends.

Convergence in technologies: Kafka and Spark

Despite the overwhelming number of tools that continue to be introduced into the data engineering space, there appear to be two notable points of convergence.

Of the numerous available queuing technologies, Kafka stands apart as the most widely adopted.

Since LinkedIn released its log-based solution to the open source community in 2011, Kafka has been steadily rising in popularity and has now become the default ingestion tool for streaming data.

Beyond streaming data, Kafka is increasingly being used as a centralized message bus for microservices at a large number of companies. With impressively high throughput in addition to great reliability and integration with many other popular technologies, the reason for its widespread adoption is easily apparent.

The other technology that has garnered wide-spread adoption is Apache Spark, the general purpose, distributed processing framework.

While many capable frameworks have emerged since Hadoop’s early monopoly on “Big Data”, Spark has cemented its position as the “default” tool for processing data at scale.

Spark has proven itself as an all-around workhorse, handling everything from traditional batch processing jobs to serving online machine learning models. Spark’s development of high-level, structured APIs like DataFrames and SQL, as well as streaming and graph libraries, allow it to solve many use cases with one accessible codebase. As with Kafka, it touts great community support and many new and existing projects are integrating with Spark.

While Kafka and Spark are popular choices, they certainly don’t fit every use case. It’s important to investigate each tool’s pros, cons, and alternatives. As we often stress at Insight, make sure you pick the right tool for the job!

Trends in architecture: Unified with Kappa

Beyond trends in specific technologies, we’ve noticed many teams progressing towards the idealized Kappa architecture. Contrary to the Lambda approach, many technologies are now adopting the position that batch processing problems are simply a subset of stream processing problems.

While not quite yet at the forefront, technologies like Flink, Apex, and Gearpump are pushing towards the vision of a unified batch and stream processing framework. Even Spark, with it’s release of Structured Streaming, is now offering a single interface to operate on both batch and streaming data.

In some sense, the Apache Beam project presents a culmination of these efforts. Based on Google’s Dataflow Model, Beam aims to create a single unified API, allowing developers to write applications agnostic of the processing engine beneath it.

With the emergence of unified processing frameworks and projects such as Apache Beam, the Kappa architecture may experience a rapid adoption rate. Regardless of the architecture, as processing frameworks continue to improve and evolve, we expect to see the line between batch and stream processing continue to blur.

Managed Services on the rise

On a slightly more contentious note, ‘serverless’ offerings are also a developing trend. There is a growing desire for data teams like The New York Times to architect pipelines without the effort of managing any underlying infrastructure. While production use-cases for these services have been relatively limited, the features they offer are continuing to improve. With services like AWS’s S3, Redshift, Athena, EMR, Kinesis, and Lambda, as well as GCP’s BigQuery, Pub/Sub, and DataProc, the major cloud providers are clearly investing in these full-service solutions.

Similar to the transition from “on-prem” servers to cloud infrastructure, it’s likely that data teams will increasingly leverage data services. In the meantime, hybrid architectures that are partially self-serviced and partially managed will become more and more common.

Trends in cloud providers: AWS vs GCP

Another notable change over the past few years has been the increase in competition faced by Amazon Web Services (AWS). While platforms like Microsoft Azure, IBM, DigitalOcean, and Rackspace have been present for awhile, it seemed like no one could challenge AWS’s first-mover advantage from its launch in 2006.

However, Google had been developing its own sophisticated infrastructure for its internal users this whole time. Indeed, Google is known for pioneering distributed systems internally, but choosing to release white papers rather than open sourcing them. With serious investment into Google Cloud Platform (GCP), they have offered Google Infrastructure For Everyone Else (GIFEE) as managed services.

Over the past few years GCP has made some impressive strides and is quickly becoming a serious contender. While GCP doesn’t quite offer the full array of services compared to AWS, more and more top teams like Spotify are making the switch. Perhaps the field of cloud providers will pare down eventually, but we see healthy competition in the near future.

Looking forward

While no one knows what the future holds for the field of data, one thing is clear — new technologies will empower us to further utilize our data. Whether new technologies and services will emerge, or the existing ones will add features, developers will have a richer set of tools available to build data pipelines and platforms.

It will continue to be an exciting time to be a data engineer.