Pipeline

Pipeline: Well oiled big data pipeline is a must for the success of machine learning.

The value of data is unlocked only after it is transformed into actionable insight, and when that insight is promptly delivered.

A data pipeline stitches together the end-to-end operation consisting of collecting the data, transforming it into insights, training a model, delivering insights, applying the model whenever and wherever the action needs to be taken to achieve the business goal.

Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.

— Clive Humby, UK Mathematician and architect of Tesco’s Clubcard

A data pipeline has five stages grouped into three heads:

Data Engineering: collection, ingestion, preparation (~50% effort)

collection, ingestion, preparation (~50% effort) Analytics / Machine Learning: computation (~25% effort)

computation (~25% effort) Delivery: presentation (~25% effort)

Collection: Data sources (mobile apps, websites, web apps, microservices, IoT devices etc.) are instrumented to collect relevant data.

Ingestion: The instrumented sources pump the data into various inlet points (HTTP, MQTT, message queue etc.). There can also be jobs to import data from services like Google Analytics. The data can be in two forms: blobs and streams. All this data gets collected into a Data Lake.

Preparation: It is the extract, transform, load (ETL) operation to cleanse, conform, shape, transform, and catalog the data blobs and streams in the data lake; making the data ready-to-consume for ML and store it in a Data Warehouse.

Computation: This is where analytics, data science and machine learning happen. Computation can be a combination of batch and stream processing. Models and insights (both structured data and streams) are stored back in the Data Warehouse.

Presentation: The insights are delivered through dashboards, emails, SMSs, push notifications, and microservices. The ML model inferences are exposed as microservices.