Comparing Python and SQL for Building Data Pipelines - Breaking into the workforce as a web developer, my first interaction with databases and SQL was using an Object Relational Model (ORM). I was using the Django query sets API and had an excellent experience using the interface...

Breaking into the workforce as a web developer, my first interaction with databases and SQL was using an Object Relational Model (ORM). I was using the Django query sets API and had an excellent experience using the interface...

Thereon-after, I changed to a data engineering role and became much more involved in leveraging datasets to build AI. It became my responsibility to take the data from the user application and turn it into something usable by Data Scientists, a process commonly known as ETL.

The data on our production system was very messy and required a lot of transformations before anyone was going to be able to build AI on top of it. There were JSON columns that had different schemas per row, columns contained mixed data types and some rows had erroneous values (people saying that they were born before 1850 or in the future). As I set out on cleaning, aggregating and engineering features for the data, I tried to decide which language would be best for the task. Having used python all day every day before this, I knew that it could do the job. However, what I learned through this experience was that just because python could do the job doesn’t mean it should.

The first time I misjudged SQL is when I assumed that SQL couldn’t do complicated transformations

We are working with a time-series dataset where we wanted to track particular users over time. Privacy laws prevent us from knowing the specific dates of the user visits, so we decided that we would normalize the date of the record to the users first visit (ie 5 days after their first visit etc.). For our analysis, it was important to know the time since the last visit as well as the time since their first visit. A had two sample datasets, one with approximately 7.5 million rows measuring 6.5 GBs, and the other with 550 000 rows measuring 900 MB.

Using the python and SQL code seen below, I used the smaller dataset to first test the transformations. Python and SQL completed the task in 591 and 40.9 seconds respectively. This means that SQL was able to provide a speed-up of roughly 14.5X!

# PYTHON # connect to db using wrapper around psycopg2 db = DatabaseConnection(db='db', user='username', password='password') # grab data from db and load into memory df = db.run_query("SELECT * FROM cleaned_table;") df = pd.DataFrame(df, columns=['user_id', 'series_id', 'timestamp']) # calculate time since first visit df = df.sort_values(['user_id', 'timestamp'], ascending=True).assign(time_since_first=df.groupby('user_id').timestamp.apply(lambda x: x - x.min())) # calculate time since last visit df = df.assign(time_since_last=df.sort_values(['timestamp'], ascending=True).groupby('user_id')['timestamp'].transform(pd.Series.diff)) # save df to compressed csv df.to_csv('transform_time_test.gz', compression='gzip') -- SQL equivalent -- increase the working memory (be careful with this) set work_mem='600MB'; -- create a dual index on the partition CREATE INDEX IF NOT EXISTS user_time_index ON table(user_id, timestamp); -- calculate time since last visit and time since first visit in one pass SELECT *, AGE(timestamp, LAG(timestamp, 1, timestamp) OVER w) AS time_since_last, AGE(timestamp, FIRST_VALUE(timestamp) OVER w) AS time_since_first FROM table WINDOW w AS (PARTITION BY user_id ORDER BY timestamp);

This SQL transformation was not only faster but the code is also more readable and thus easier to maintain. Here, I used the lag and first_value functions to find specific records in the users history (called a partition). I then used the age function to determine the time difference between visits.

What’s even more interesting is that when these transformation scripts were applied to the 6.5 GB dataset, python completely failed. Out of 3 attempts, python crashed 2 times and my computer completely froze the 3rd time… while SQL took 226 seconds.

More info:

https://www.postgresql.org/docs/9.5/functions-window.html

http://www.postgresqltutorial.com/postgresql-window-function/

The second time I misjudged SQL is when I thought that it couldn’t flatten irregular json

Another game changer for me was realizing that Postgres worked with JSON quite well. I initially thought that it would be impossible to flatten or parse json in postgres…I can’t believe that I was so dumb. If you want to relationize json and its schema is consistent between rows, then your best bet is probably to use Postgres built in ability to parse json.

-- SQL (the -> syntax is how you parse json) SELECT user_json->'info'->>'name' as user_name FROM user_table;

On the other hand, half the json in my sample dataset isn’t valid json and thus is stored as text. In which case I was left with a choice, I could either recode the data to make it valid OR I could just drop the rows that didn’t follow the rules. To do this, I created a new SQL function called is_json that I could then use to qualify valid json in a WHERE clause.

-- SQL create or replace function is_json(text) returns boolean language plpgsql immutable as $ begin perform $1::json; return true; exception when invalid_text_representation then return false; end $; SELECT user_json->'info'->>'name' as user_name FROM user_table WHERE is_json(user_json);

Unfortunately, I found that the user_json had a different schema depending on what app version the user was on. Although this makes sense from an application development point of view, it makes it really expensive to conditionally parse every possibility per row. Was I destined to enter python again… not a chance! I found another function on stack-overflow written by a postgres god named klin.

-- SQL create or replace function create_jsonb_flat_view (table_name text, regular_columns text, json_column text) returns text language plpgsql as $ declare cols text; begin execute format ($ex$ select string_agg(format('%2$s->>%%1$L "%%1$s"', key), ', ') from ( select distinct key from %1$s, jsonb_each(%2$s) order by 1 ) s; $ex$, table_name, json_column) into cols; execute format($ex$ drop view if exists %1$s_view; create view %1$s_view as select %2$s, %3$s from %1$s $ex$, table_name, regular_columns, cols); return cols; end $;

This function was able to successfully flatten my json and solve my worst nightmare quite easily.

Final Comments

There is an idiom that declares Python as the second best language to do almost anything. I believe this to be true and in some instances have found the performance difference between Python and the ‘best’ language to be negligible. In this case however, python was unable to compete with SQL. These realizations along with readings I’ve been doing has completely changed my approach to ETL. I now work under the paradigm of “Do not move data to code, move code to your data”. Python moves your data to the code while SQL acts on it in place. What’s more is that I know that I’ve only scratched the surface of sql and postgres abilities. I’m looking forward to more awesome functionality, and the possibility of getting speed ups from using an analytical warehouse.

python sql database