In my previous post I talked about how to architect your data warehouse, especially if you’re using Amazon Redshift.

To build that architecture in practice, we need a tool to Extract, Load and Transform (ETL) our data through the layers. This could be completed using traditional ETL tool such as Informatica, Pentaho, Talend or many more.

These tools are great but you may find that Amazon’s Data Pipeline tool can also do the trick and simplify your workflow.

In this post I’ll outline some of the basics of Data Pipeline and it’s pros and cons vs other ETL tools in the market.

What is Data Pipeline?

Data pipeline is an ETL tool offered in the AWS suite. It has a web based graphical interface that allows you to create pipelines from a number of different building blocks.

These building blocks represent physical nodes; servers, databases, S3 buckets etc and activities; shell commands, SQL scripts, map reduce jobs etc.

You join the blocks together to form a pipeline for your data to flow through. It even comes with some out of the box workflows to bring data into Redshift from S3, Import or Export data to and from DynamoDB using S3 and even run a job on an Elastic Map Reduce (EMR) cluster.

It can do some pretty complex stuff too as demonstrated by Swipely’s workflow in the image below.

Data Pipeline vs the market

Infrastructure

Like any other ETL tool, you need some infrastructure in order to run your pipelines.

Where Data Pipeline benefits though, is through its ability to spin up an EC2 server, or even an EMR cluster on the fly for executing tasks in the pipeline.

This inevitably keeps pricing lower as the infrastructure is only in use for the duration of the job, after which it’s destroyed. It also means its a lot easier to get up and running without having to worry about physical infrastructure.

Don’t worry, if you’re worried about the overhead of spinning up servers all the time, you can still get your pipelines to use a “static” EC2 instance or EMR cluster.

Ops

These days, all engineers should care about operations. When building an application, key things like monitoring and alerting as well as infrastructure considerations should be taken into account.

Data Pipeline has an in-built alerting capability that utilises amazons simple notification service (SNS). This can be used to send text messages or emails to alert on failure of a specific step, or the whole pipeline.

The service also has the obvious benefit of being accessible from anywhere with an internet connection. Due to the tool being web based, you can access and understand failures away from your work computer and hot fix.

Hugely beneficial if you’re on call to fix important failures, as you can do it from the comfort of your bed via your smartphone :D

Complexity

If you have experience with ETL tools, then using Data Pipeline should be fairly simple. This section wont necessarily talk about the complexity of the tool (as all tools require some sort of learning curve) but the complexity of your use case and how that would fit within Data Pipeline.