Pyspark The Right Way

And my journey learning it

What’s wrong with SQL?

I’m a data engineer and I hate writing SQL. Unless I’m missing something (I probably am), the whole process of creating a bunch of .sql and .sh scripts insane. Coming from a developer background, I found myself asking questions like “why am I copying and pasting all this stuff” and more importantly “what the hell does all of this mean” by the time I’m done with a somewhat complicated query. Nothing was DRY and testing any of my work was a royal pain.

What’s so great about Python?

I used Python a little bit in school and then at work to help automate some boring stuff I have to do. When done right, it’s incredibly elegant and efficient. You can pick up the basics quickly and get tons of functionality quickly. Throw in something like argparse and you get a full command-line application that you can share with your team and customers that’s fully reusable. Neat!

What about Pandas and NumPy?

I knew there were data-centric libraries out there like Pandas and NumPy, but I found them pretty strange to work with…what’s up with these axes and weird indexing magic? By the time I connected to a database and did what I had to, I couldn’t easily follow what I wrote or why it worked.

Okay, so why Pyspark?

My team uses Hadoop a lot, and what‘d ya know? We have Spark installed! Even better, there’s this wonderful thing called Pyspark and their documentation actually looks good. Browsing through examples and programming guides, it just makes sense.

So I started using it whenever I could when I otherwise would have written SQL scripts and realized how much happier I was doing it. I could be creative and take shortcuts I couldn’t before thanks to Python. I could make things reusable. I could easily make command-line interfaces. I could add logging for God’s sake! Obviously, I told everyone and their mothers to start using it.

Warming up…

Not everyone is a Python guru, nor do they have any interest in becoming one. Some folks will happily go about their lives using pyspark like this:

df = spark.sql('SELECT firstname, lastname, dob, startdate, status, title FROM employee ORDER BY dob DESC')

Does it work? Sure. Is it really any different than what you were doing before? Not really. Either way, it lets people ease their way into Spark, which is awesome.

There are all sorts of things you can do differently, though. To start, you can follow PEP 8 (and not cram everything on one line), use functions to make things look a little nicer, and give variables meaningful names besides the one used in every example online. Maybe you even realized you’re only specifying all these columns because you’re just trying to exclude one.

employee_cols = ['firstname', 'lastname', 'dob',

'startdate', 'status', 'title'] employees = (spark.table('employee')

.select(employee_cols)

.orderby('dob', ascending=False))

# or...

employees = (spark.table('employee')

.drop('ssn')

.orderby('dob', ascending=False))

This makes my developer brain happier. We can move what we need into separate variables, follow a fluent interface, and make things easy to understand. If you’re using a decent IDE, you’ll even get some sweet auto-complete and linting.

Exploring Re-usability

We can probably do more with this. Let’s make things reusable! One pet peeve I always had with Hive is the simple task of parsing dates from a string. Every time you want to do it, you have to write something crazy like this (thanks to StackOverflow every time I need to do this):

SELECT cast(to_date(from_unixtime(unix_timestamp(MYDATE, 'dd-MM-yyyy')) as date) FROM mytable;

(Pretend Spark 2.2 wasn’t out when you read this part).

Pyspark doesn’t inherently make it easier, but you only need to write it once! Build out your utils.py and never worry about it again.

def convert_date(date_col, fmt='yyyy-MM-dd'):

"""Convert a string to a date type in a sane way."""

ts = unix_timestamp(date_col, fmt)

dt = ts.cast('date')

return dt

(employee

.withColumn('dob', convert_date(employee.dob))

.show())

Even More Re-usability…

I think one of the most powerful things to do is to make your pyspark scripts run them with command-line arguments. You get to take out hard-coded references and abstract your code, which can really make life easier…like this neat little script to convert data between different file formats:

formats = ['csv', 'json', 'avro', 'parquet', 'orc', 'jdbc']

modes = ['overwrite', 'append', 'ignore', 'error'] parser = argparse.ArgumentParser(description='My Format Converter!') parser.add_argument('source', help='Source dataset to convert')

parser.add_argument('source-format', choices=formats,

help='Format of the source dataset') parser.add_argument('target', help='Target location')

parser.add_argument('target-format', choices=formats,

help='Format of the target dataset') parser.add_argument('-m', '--mode',

default='error', choices=modes,

help='Behavior when data already exists') args = parser.parse_args() source = (spark.read

.format(args.source_format)

.load(args.source)) (source

.write

.format(args.target-format)

.mode(args.mode)

.save())

How do I test this stuff?

I’m not a fan of having to create tables and stuff them full of data so I can then manually test queries on them. We can use good old fashioned Python unit tests or just some assert statements instead now. There are plenty of discussions on how to properly unit test your code, too.

# Make up some data to test

test_df = spark.createDataFrame([('2018-07-26',)], ['date']) # Convert the date, get the value back (first row, first column)

mydate = (test_df

.select(convert_date(test_df.date))

.first()[0]) assert mydate is not None

assert isinstance(mydate, types.DateType)

In closing…

Python and Spark are both great, but putting them together is awesome. If you take the time to carefully write your code, things become infinitely easier to read through, test, and eventually you’ll end up saving time by holding onto those crazy functions you already wrote! I realize this only scratched the surface with everything else you can do with them combined.