Python UDFs in Amazon Redshift by Elise | September 26, 2016

Shoutout to Wagon + an Exciting Announcement

This post originally appeared on the Wagon blog, the team behind the SQL editor that many of us know, love, and use on a daily basis.

As you may have seen, Wagon recently announced that they will be joining the Box family and shutting down the current Wagon product on October 3, 2016.

For the SQL folks out there worrying about where they will get their SQL on (myself included!), we are excited to share that we will be adding SQL support to our data science environment, Rodeo!

SoO, get pumped for that. And in the meantime, here's an overview of Amazon Redshift and an introduction to Python UDFs for all you Python + SQL lovers!

In the Beginning...

In the early 2000’s, the behemoth collection of cloud computing services we now know as Amazon Web Services (AWS) was little more than sparks firing in Chris Pinkham and Benjamin Black’s neurons. In 2003, the two presented a paper (blog post here) outlining a radical vision for a retail computing infrastructure that would be standardized, automated and built upon web services. The next year, Simple Queue Service, the first AWS service for public usage, was launched.

Fast forward almost a decade, and AWS is now the most commonly used cloud platform among enterprise software developers. AWS products span the gamut of web services, from computation (eg EC2) to networking (eg VPC) and content delivery (eg S3). In this post we’ll explore a small fraction of a fraction of the AWS ecosystem–a database that’s generating all kinds of groundswell right now: Amazon Redshift.

The rest of this post will talk about Redshift at a high level and then dive into a mini overview of User Defined Functions (UDFs), how they work, why they’re great, and how to start using them.

Amazon Redshift

Amazon Redshift is a hosted data warehouse that’s accessible / easy to set up, and built for speed and suitable for a variety of combining, storing, and compute-heavy analytics tasks.

Two things make Redshift particularly attractive. First, Redshift can handle insane amounts of data–it is a petabyte-scale warehouse. A petabyte is a lot (1015 bytes) of data. As a point of reference, the entire master catalog of Netflix video in 2013 amounted to about 3.14 petabytes of storage space (interesting read on Quora). Second, unlike Amazon’s other hosted database product, Amazon RDS, Redshift stores data according to column-based structure. Column orientation is good for tables containing columns with lots of repeated values (i.e. Credit Card Names, County/State, Product Type, etc, like CRM data. The benefit of column data is that because it’s uniform, there are opportunities for storage size optimization via compression. You can read more about how to maximize compression here.

Redshift handles large scale column-oriented datasets using massive parallel processing, performing coordinated computations across a large number of processors in parallel, making it a fast and powerful data warehouse option.

Data Warehouse Setup in the Data Age

Even just a few years ago, getting a data warehouse and proper ETL processes in place was a long, painful and probably very expensive ordeal. But we’ve arrived in the data age where easy-to-use, affordable data solutions are bountiful.

At Yhat, we use a Redshift to warehouse everything–CRM data (we use SFDC), product data, site metrics from Google Analytics, and data from a bunch of other data. It took us about 20 mins to set up the database on AWS, and it took us…wait for it…another 20 mins or so to set up all of our ETL using Fivetran which we couldn’t be more impressed with.

SQL IDE Done Right

Most SQL IDEs of yesteryear leave something to be desired in terms of UX. The majority are clunky and have super old school frankenstein UIs. Why they all focus on making exploring the DB schema rather than on making it easy to write queries, view results and think critically about your data has always been a mystery.

Well those days are also over. Wagon was the query-focused SQL app I’ve been looking for for years. Wagon boasted a clean UX designed analysts. Features were carefully chosen with a keen eye for usability for people writing tens or hundreds of queries per day. Wagon got it in spades.

Wagon, we will miss you immensely, and wish your team the best on their new adventure at Box. We hope that y'all love the SQL editor we build into Rodeo as much as we loved Wagon!

PS Your tshirts are still the best and softest around. Okay, now back to UDFs.

Overview of Python UDFs in Redshift

UDF stands for user-defined function, meaning that you can add functions to an environment (in this case, Redshift) in addition to those that come built in. Python UDFs allow you combine the power of Redshift with what you know and love about the Python programming language without switching between IDEs or systems.

The great thing about UDFs in Redshift is that Redshift will automatically execute it using its MPP architecture. One caveat to keep in mind is that your Python code still won’t execute as quickly as native SQL functions ( AVG , MIN , MAX , etc.) that are baked into the database.

How to Use UDFs

You can certainly work with text in pure SQL, but some tasks are just easier to do in a scripting language like Python instead. Here’s a toy example to illustrate how to use Python functionality within Redshift using a UDF.

Suppose a column in one of our tables contains huge chunks of text or html, and we’re interested to find any email addresses within any one record. Let’s write a function that will take in raw text and return a pipe | separated string containing any email addresses found within the input text document.

Define the function like so:

CREATE OR REPLACE FUNCTION f_findall_emails( txt VARCHAR(20000) ) RETURNS VARCHAR(20000) IMMUTABLE AS $$ """Extract all email addresses found within a given string `txt`. Return: A pipe-delimited string composed of any email addresses found. (e.g. 'john@example.com|jeane@example.com|sara@example.com') Example: mydb=# SELECT f_find_all_emails(email_body) AS emails_found FROM customer_support_emails LIMIT 1; Returns: b.hunt@hunt.uk.co|charlie@vitaeodio.ca|steven@nonsapien.net|charlie@vitaeodio.ca|charlie@vitaeodio.ca (5 rows) """ import re RG_EMAIL = re.compile(r'([\w\-\.]+@(\w[\w\-]+\.)+[\w\-]+)') if not txt: return None try: emails_found = RG_EMAIL.findall(txt) return "|".join([email[0] for email in emails_found]) except: return None $$ LANGUAGE plpythonu;

Once defined, you can use it like this:

SELECT f_find_all_emails(email_body) AS emails_found FROM emails LIMIT 1; -- results will look like this -- b.hunt@hunt.uk.co|john@vitaeodio.ca|steven@nonsapien.net.ca|charlie@vitaeodio.ca -- (5 rows)

This is a scalar function, so it’ll return one record for each input row (i.e. not an aggregate function). One thing to remember is that your UDFs are per-database, meaning that if you have multiple in your Redshift cluster, you’ll need to define your functions in each database.

Example

Redshift Python UDFs are based on Python 2.7 and come preloaded with a lot of our favorite libraries, including NumPy, SciPy and Pandas. You can also import custom modules from S3 and the web.

Here’s the template published on the AWS blog that you can use to start creating your own scalar functions:

CREATE [ OR REPLACE ] FUNCTION f_function_name ( [ argument_name arg_type, ... ] ) RETURNS data_type { VOLATILE | STABLE | IMMUTABLE } AS $$ python_program $$ LANGUAGE plpythonu;

The scalar UDFs that you create will return a single result value for each input value. Once you’ve defined a UDF, you can use it in any SQL statement. One thing to remember is that your UDFs are per-database, meaning that if you have multiple in your Redshift cluster, you’ll need to define your functions in each database.

Helpful Resources

To learn more about Python UDFs in Redshift, check out Amazon’s documentation, which is super helpful and covers everything from constraints to security and python support. You can also check out this blogpost from AWS that uses UDFs to analyze the CMS Open Payments data set.