Without a good documentation, it is impossible to know:

what are the required columns in the input DataFrame?

what are the columns added to the output DataFrame?

what are the types of the input/output columns: are they String, Double, Int?

If you have a non-trivial program which composes several such transformations, it becomes tricky to follow what is going on.

Without proper unit testing, your program becomes brittle and breaks with simple changes.

You start to feel as if you were using some kind of dynamic language. This can be beneficial in some situations, but then why would you use Scala for that?

A good API should let you manipulate data flexibly when you do not know its structure in advance, but it should allow you to put strong constraints when you do know its structure.

For instance, a program which executes a parametrized SQL statement to write data to a directory would use a 'flexible' API => DataFrame

On the other hand, a program which executes pre-defined statements to compute a sum or an average should use a 'constrained' API.

DataSets to the rescue

These problems have been acknowledged by the Spark development team for quite some time. This is why Spark 1.6 brought us the experimental DataSet API. But do they really bring us the type safety we are looking for? Let's have a look:

Select