This blog post will show how to chain Spark SQL functions so you can avoid messy nested function calls that are hard to read. Spark doesn’t provide a clean way to chain SQL function calls, so you will have to monkey patch the org.apache.spark.sql.Column class and define these methods yourself or leverage the spark-daria project.

Illustrating the problem

Suppose you’d like to trim the whitespace of a column and lowercase all of the letters. Here’s the code you’ll write.

val wordsDf = Seq(

("Batman "),

(" CATWOMAN"),

(" pikachu ")

).toDF("word")



val actualDf = wordsDf.withColumn(

"cleaned_word",

trim(lower(col("word")))

)

The code is ugly with two nested function calls and it will only get worse as more methods are added.

The Dataset class defines a transform method to allow for method chaining instead of nested function calls. I talked about the transform method in depth in a previous blog post.

The Column class doesn’t provide a clean work-around, so we will define a method that can chain SQL function calls using the Dataset#transform method as our inspiration.

Chaining SQL function calls

Let’s monkey patch the Column class and add a chain method that will enable us to run code like col(“word”).chain(trim).chain(lower) .

import org.apache.spark.sql.Column

import org.apache.spark.sql.functions._



object ColumnExt {



implicit class ColumnMethods(c: Column) {



def chain(t: (Column => Column)): Column = {

t(c)

}



}



}

An implicit class is used to monkey patch the Column class. The chain method takes a function as an argument and returns a Column object.

Here’s the code in action:

import ColumnExt._ val wordsDf = Seq(

("Batman "),

(" CATWOMAN"),

(" pikachu ")

).toDF("word")



val actualDf = wordsDf.withColumn(

"cleaned_word",

col("word").chain(lower).chain(trim)

)



actualDf.show() +----------+------------+

| word|cleaned_word|

+----------+------------+

| Batman | batman|

| CATWOMAN| catwoman|

| pikachu | pikachu|

+----------+------------+

The chain method only works out of the box for SQL functions that take exactly one Column argument. We’ll need to make some tweaks for SQL functions that take multiple arguments.

SQL Functions that take arguments

You can use Scala’s magical underscore operator to chain SQL functions that take arguments.

Take a look at the rpad method signature.

def rpad(str: Column, len: Integer, pad: String): Column

Let’s see how to use rpad with the chain method.

val wordsDf = Seq(

("hi "),

(" ok")

).toDF("word")



val actualDf = wordsDf.withColumn(

"diff_word",

col("word").chain(trim).chain(rpad(_, 5, "x"))

)



actualDf.show() +----+---------+

|word|diff_word|

+----+---------+

|hi | hixxx|

| ok| okxxx|

+----+---------+

This is one of the many uses of Scala’s magical underscore operator.

Chaining User Defined Functions

Let’s add another method to the Column class that will make it easy to chain user defined functions (UDFs).

User defined functions have a different method signature than the built-in SQL functions, so we need to monkey patch the Column class again.

object ColumnExt {



implicit class ColumnMethods(c: Column) {



def chainUDF(udfName: String, cols: Column*): Column = {

callUDF(udfName, c +: cols: _*)

}



}



}

The chainUDF method takes String and Column arguments and returns a Column . The Scala syntax is a little crazy to account for a variable number of Column arguments (h/t internet stranger for helping me write this code).

Let’s see chainUDF in action.

def appendZ(s: String): String = {

s"${s}Z"

}



spark.udf.register("appendZUdf", appendZ _)



def prependA(s: String): String = {

s"A${s}"

}



spark.udf.register("prependAUdf", prependA _)



val hobbiesDf = Seq(

("dance"),

("sing")

).toDF("word")



val actualDf = hobbiesDf.withColumn(

"fun",

col("word").chainUDF("appendZUdf").chainUDF("prependAUdf")

)



actualDf.show() +-----+-------+

| word| fun|

+-----+-------+

|dance|AdanceZ|

| sing| AsingZ|

+-----+-------+

If you don’t use the chainUDF method, you’ll need to write some ugly code like this.

val actualDf = hobbiesDf.withColumn(

"fun",

callUDF("prependAUdf", callUDF("appendZUdf", col("word")))

)

The following code snippet demonstrates how callUDF can be used with user defined functions that take multiple arguments.

def appendZ(s: String): String = {

s"${s}Z"

}



spark.udf.register("appendZUdf", appendZ _)



def appendWord(s: String, word: String): String = {

s"${s}${word}"

}



spark.udf.register("appendWordUdf", appendWord _)



val hobbiesDf = Seq(

("dance"),

("sing")

).toDF("word")



val actualDf = hobbiesDf.withColumn(

"fun",

col("word").chainUDF("appendZUdf").chainUDF("appendWordUdf", lit("cool"))

)



actualDf.show() +-----+----------+

| word| fun|

+-----+----------+

|dance|danceZcool|

| sing| singZcool|

+-----+----------+

It gets even better! The chain and chainUDF methods both return Column objects, so they play nicely with each other.

Using chain and chainUDF together

The chain and chainUDF methods can be used together to write some beautiful code.

def appendZ(s: String): String = {

s"${s}Z"

}



spark.udf.register("appendZUdf", appendZ _)



val wordsDf = Seq(

("Batman "),

(" CATWOMAN"),

(" pikachu ")

).toDF("word")



val actualDf = wordsDf.withColumn(

"cleaned_word",

col("word").chain(lower).chain(trim).chainUDF("appendZUdf")

)



actualDf.show() +----------+------------+

| word|cleaned_word|

+----------+------------+

| Batman | batmanZ|

| CATWOMAN| catwomanZ|

| pikachu | pikachuZ|

+----------+------------+

Nobody wants to look at code like this:

val actualDf = wordsDf.withColumn(

"cleaned_word",

callUDF("appendZUdf", lower(trim(col("word"))))

)

Next Steps

The chain and chainUDF methods are defined and tested in the spark-daria library. Please submit pull requests if you can make these methods more robust!

I am working on the spark-spec project to document the behavior of the Spark engine and that’s the only reason I had enough familiarity with Spark to write the code in the blog post. Let me know if you’d like to contribute to spark-spec or join our Slack channel (my email is here).