David Lyon is an alum from the Insight Data Engineering program in Silicon Valley, now working at Facebook as a Data Engineer. In this post, he tells us why Scala and Spark are such a perfect combination.

In this guide, I will make the case for why Scala’s features make it the ideal language to use for your next distributed computing project.

The most important Scala features:

Scala is functional Scala is strongly typed Scala uses the Java Virtual Machine Spark is written in Scala Scala is the highest paying language of 2017

The question many new Insight Fellows ask is:

“Should I learn Scala and Data Engineering together?”

For Fellows with a strong math background, functional programming will make perfect sense and you’ll take to it like a Californian to avocado toast. However, I’d like to warn those of you without a math background that are also new to both distributed computing and to functional programming: learning them together will be like learning to ride a unicycle before learning the bicycle.

Scala is Functional

The key distinction between functional and imperative programming is the concept of a pure function.

A pure function’s output only depends on its input parameters, which are immutable. It will always produce exactly the same output given the same input parameters. This output consists entirely of its return value. Any other output is called a “side effect” and renders the function impure.

Example of a Pure Function

Contrast this with imperative programming, in which procedures mutate their input parameters or otherwise modify variables outside of their scope. Imperative procedures often return None, Null, or a similar empty type.

A procedure with side effects

Why do pure functions matter? There are two types of parallelization: task and data. The first introduction of many programmers to the concept is task parallelization. The most common example is multi-threaded programming on a multi-core CPU where each core has access to the same global memory. In that case, imperative programming, which may depend on some internal state or global state, is not so dangerous because the program will produce repeatable results when every thread will see the same data every run.

Data Engineering, however, is about data parallelization.

A Data Engineer (DE) may need to process data equal to a thousand times the memory capacity of a single CPU. In that case, the data will have to be sharded into a thousand pieces, and each instance of the procedure will see only a small fraction of the total data. Reasoning about the interaction of thousands or millions of procedures, each with their own independent side effects, becomes impossible. The ways that race conditions or lost data can occur due to interacting side effects are too numerous to list.

Another functional programming advantage of Scala is partial function application.

A traditional function of 5 parameters needs to assemble all 5 parameters at the same place and time before it can operate. However, in Scala, a function of 5 parameters can, for example, accept 3 parameters and return a function of 2 parameters.

A common use case of partial function application could be combining continuously arriving streaming user data with daily aggregated data that arrives once per day, after midnight. The 5 parameter function could be continuously transformed into its 2 parameter version all day by evaluating 3 parameters from streaming user data. The 2 parameter functions could be sent to a new location to await midnight. At midnight, an entire day’s worth of 2 parameter functions could finish their calculations by applying the last two parameters from the aggregated data to the function.

The special case where parameters are evaluated one at a time is called currying. The simplest case of currying, two parameters being evaluated sequentially, is shown below.

Partial Function Application (Currying example)

Lastly, Scala allows lazy evaluation. Loosely speaking, lazy evaluation means passing unevaluated parameters forward as functions until an answer is required, and then evaluating the parameters when a value is needed for I/O.

Let me give an example. You have a billion rows of scanned user data and you want to find 10 rows with “Steve” in the first name field. However, the data quality is dubious, some rows contain names like “SteVe” and “?ste!ve”.

First you need to strip non-alphanumeric characters, then convert the name to lowercase, and finally compare “steve” with the result. Without lazy evaluation, your program will first strip non-alphanumerics from 1 billion rows, then convert 1 billion rows to lowercase, and finally discover that the first 10 rows were “steve” the whole time. With lazy evaluation, Scala will stream single rows through the 3-function pipeline until it finds the 10 results it needs, and then halt streaming. Whew, 999,990 rows of work saved by being lazy!

Scala uses the JVM

As of the Fall of 2017, Java is still the most popular programming language. Why abandon Java for Scala, a much newer and much less popular language? Java brings two decades of packages and the Java Virtual Machine, which allows the same Java code to run on any hardware. Fortunately, you don’t have to give up anything! Scala compiles to Java bytecode and is fully compatible with all Java libraries. Think of Scala as concise Java, but fully functional.