What’s your definition of a data pipeline?

(JG)

Data Pipeline – A arbitrarily complex chain of processes that manipulate data where the output data of one process becomes the input to the next.

IMHO ETL is just one of many types of data pipelines — but that also depends on how you define ETL 😉

(DW)

This term is overloaded. For example, the Spark project uses it very specifically for ML pipelines, although some of the characteristics are similar.

I consider a pipeline to have these characteristics:

1 or more data inputs 1 or more data outputs optional filtering optional transformation, including schema changes (adding or removing fields) and transforming the format optional aggregation, including group bys, joins, and statistics robustness features resiliency against failure

when any part of the pipeline fails, automated recovery attempts to repair the issue

when an interrupted pipeline resumes normal operation, it tries to pick up where it left off, subject to these requirements: If at least once delivery is required, then the pipeline ensures that processing of each record happens at least once, involving some sort of acknowledgement If at most once delivery is required, the pipeline can start after the last record that it read at the beginning of the pipeline If exactly (effectively?) once delivery is required, the pipeline uses deduplication mechanisms with at least once to output a result once and only once (subject to the fact that it’s impossible to make this guarantee for all possible scenarios)

management and monitoring hooks allow issues, as well as normal operational characteristics, like performance criteria, to be available

I wouldn’t necessary add latency criteria to the basic definition. Sometimes a pipeline is “watch this directory and process each file that shows up.”

I think real ETL jobs are pipelines, because they must satisfy these criteria. Depending how broadly you define ETL, then all pipelines could be ETL jobs.

(MG)

ETL process, in my opinion, carries the baggage of the old school relational ETL tools. It was and is simply a process that picks up data from one system, transforms it and loads it elsewhere. When I hear the term ETL process, two things ring a bell in my mind – “batch” and “usually periodic”.

When I hear the term data pipeline, I think of something much broader – something that takes data from one system to another, potentially including transformation along the way. However, this includes newer streaming like processing and older ETL processes. So, to me data pipeline is a more generic, encompassing term that includes real-time transformation. One point I would note is that data pipeline don’t have to have a transform. A replication system (like LinkedIn’s Gobblin) still sets up data pipelines. So, while an ETL process almost always has a transformation focus, data pipelines don’t need to have transformations.

(RW)

I’d define data pipeline more broadly than ETL. An ETL process is a data pipeline, but so is:

automation of ML training (ex: pull data from warehouse, feed to ML engine as a training set, update results in a production database that’s being used for real-time recommendations)

data quality pipeline (ex: run a query on the ML-generated values above, confirm they’re within a range of reasonableness, alert if not)

ingestion from external sources (ex: fetching data from Salesforce API, drop into warehouse ELT)

metric computation: roll ups of engagement/segmentation metrics

sessionization: re-ordering events to tell clearer user behavior stories

data pipelines can be real-time (kafka consumer pulls data from kafka, algorithm coefficients from redis, runs ML algorithm and presents recommendation to user in real-time)

(EC)

What is a pipe? It takes an input (water from the utility), and gives you output (water in your house).

What is a pipeline? Like an oil pipeline? Same thing, but possibly transforms the input to outputs.

So for me, a data pipeline can be thought of as a function which, given some data as input, transforms it and returns output data that is transformed.

(DB)

I don’t like Data Pipeline. I like Data Processing Pipeline.

Data Processing Pipeline is a collection of instructions to read, transform or write data that is designed to be executed by a data processing engine. Typically used by the Big Data community, the pipeline captures arbitrary processing logic as a directed-acyclic graph of transformations that enables parallel execution on a distributed system.

(JA)

A process to take raw data and transform in a way that usable by the entire organization. This data is served up by technologies that are the right tools the job and that are correct for the use cases. The data itself is available in formats that keep in mind the changing nature of data and the enterprise demand for data.

(TA)

I view data pipelines relatively broadly as being any software system that takes data from one or more inputs and transforms it in some way before writing it to one or more outputs. This can mean something like a Hadoop, Spark, Flink, Beam, etc. pipeline reading from and writing to some type of distributed data store, but can also mean a set of serverless functions operating on HTTP calls, some hand-rolled job reading and writing Kafka, or even your frontend WebUI servers taking user input and turning that into database writes. The reason I lump them all under the same umbrella is that they’re all doing the same task at the core: reading input data, transforming it in some way, and writing it as output data. And as such, they all struggle in similar (though sometimes different) ways with difficulties with difficulties like duplicate detection and output consistency. And they all present similar challenges in understanding things like progress, latency, completeness, and correctness.

(SO)

I think of data pipelines of a way to move data from point A to point B,(c,d,e…) with or without transformations, in realtime or in batch, has guarantee of not losing data (I don’t like leaky pipes), and you have full traceability of the pipeline and the performance (got to know when it’s slowing down, leaking, or stopped working)