In this post let’s look into the Spark Scala DataFrame API specifically and how you can leverage the Dataset[T].transform function to write composable code.

Note: a DataFrame is a type alias for Dataset[Row].

The example

There are some transactions coming in for a certain amount, containing a “details” column describing the payer and the beneficiary:

Note that this DataFrame could be a Dataset[Transaction], but it’s not useful to the examples

Without .transform

Let’s create 2 functions to process the transactions:

sumAmounts : to sum the amount value aggregated over one or more column(s)

: to sum the amount value aggregated over one or more column(s) extractPayerBeneficiary: to separate the payer and beneficiary from a column into 2 new columns

Use these methods to answer the following question: “What beneficiaries on which days had a total amount value greater than 25?”

This is a simple example, but it does not read well. Compare it to the typical use of Dataset functions:

df.select(...).filter(...).withColumn(...)...

We factored out some logic into methods and it helped us reason about each piece separately, but the readability of the code got arguably worse.

Using .transform

The transform function is a method of the Dataset class and its purpose is to add a “concise syntax for chaining custom transformations.”

def transform[U](t: Dataset[T] => Dataset[U]): Dataset[U] = t(this)

It takes a function from Dataset[T] , T being the type of the rows in your Dataset, to Dataset[U] , U being the type of the rows in the resulting Dataset — U can be the same as T.

A function DataFrame => DataFrame fits that signature — if we unpack the type alias we get Dataset[Row] => Dataset[Row] where T and U are both Row .

Using the methods you defined earlier and simply switching to using .transform is a good starting point:

Going a step further

The sumAmounts and extractPayerBeneficiary methods are not fitting in .transform very well. This is because these methods return a DataFrame, not a function DataFrame => DataFrame , so you constantly need to be using an underscore in place of the DataFrame argument in order to return a function that can be used in .transform .

You can rewrite those methods to return functions of the signature: DataFrame => DataFrame , to match exactly the .transform parameter type:

Only the signature had to be changed and a “df =>” added!

Now you removed the need for the “underscores” and, can combine the functions in different ways:

All of your custom transformations now return DataFrame => DataFrame , so you can use a type alias to better describe the returned value:

type Transform = DataFrame => DataFrame

e.g. def sumAmounts(by: Column*): Transform

Summary

Custom transformation methods can be re-arranged to return a function of type DataFrame => DataFrame .

. Returning functions make it easier to compose transformations and use them with .transform .

. A type alias can be used to explicitly define a “transformation”.

You can find my build.sbt and the code above in this gist