Why I'm Making a New ETL Framework

Given the variety of approaches available already for running extract-transform-load processes (ETLs), why am I making a new one, DataDuck ETL? After exploring other options, code-based framework for moving data into an Amazon Redshift data warehouse.

When I first started consulting with companies to set up data warehouses, I found a few things. First, they really wanted to use Amazon Redshift, although other technologies were considered.

Second, the companies I've worked with wanted to avoid Hadoop, seeing it as too complicated for their needs. When I would suggest using Hadoop, they would ask me about the alternatives and ask that I write something in a dynamic programming language (like Ruby or Python). While they have gigabytes of data, they don't have so much data that they really need Hadoop, and an alternative would actually work better for these organizations.

Third, they wanted a code-based framework so that they ensured 100% privacy and security. It also meant they could customize it, monitor it, test it, and track in version control. While there are codeless, SaaS services out there that import databases into Amazon Redshift with a web-based UI, it turns out a lot of companies actually prefer a code-based framework. On the other hand, a lot of companies prefer to not have to manage any code or servers, and those companies can use other services instead of DataDuck ETL.

The objective follows from those three observations: make a code-based ETL framework that makes it straightforward to move data into Amazon Redshift.

DataDuck ETL should feel familiar to users of Ruby on Rails. They're both written in Ruby. Further, they both have `console` and `dbconsole` commands. They also both declare validations with a similar syntax.

For those who are interested in seeing an example, here's an example. Running `dataduck etl decks` is all that's needed to ETL the following table:

class Decks < DataDuck::Table source :my_database, :decks, ["id", "name", "user_id", "cards", "num_wins", "num_losses", "created_at", "updated_at", "is_drafted", "num_draft_wins", "num_draft_losses"] transforms :calculate_num_totals def calculate_num_totals(row) row[:num_total] = row[:num_wins] + row[:num_losses] row[:num_draft_total] = row[:num_draft_wins] + row[:num_draft_losses] row end output({ :id => :integer, :name => :string, :user_id => :integer, :num_wins => :integer, :num_losses => :integer, :num_total => :integer, :num_draft_total => :integer, :created_at => :datetime, :updated_at => :datetime, :is_drafted => :boolean, # Note that num_draft_wins and num_draft_losses # are not included in the output, but are used in # the transformation. }) end

The project is just getting started, but is already being used by real startups to move their data to Amazon Redshift. If you or your company would be intersted in learning more, get in touch!