SBT, Scala and Spark

Preparing SBT

sbt is an open source build tool for Scala and Java projects, similar to Java’s Maven or Ant. Its main features are: native support for compiling Scala code, and integration with many Scala test frameworks.

Howebrew is required to install sbt in a mac. It is a mac package manager -similar to apt-get and yum for Linux- To install Homebrew we can run the following command.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

To install sbt we run:

brew install sbt

After the command finishes, we can try to check if it has been installed correctly sbt

Setting the project up

Now that we have sbt, we can build Scala projects.

In our project, we will need a build.sbt file with the build definition. sbt projects should have the following file structure.

src/

main/

resources/

<files to include in main jar here>

scala/

<main Scala sources>

java/

<main Java sources>

test/

resources

<files to include in test jar here>

scala/

<test Scala sources>

java/

<test Java sources>

IntelliJ can create this structure automatically.

Generated files (compiled classes, packaged jars, managed files, caches, and documentation) will be written to the target/directory by default.

Once we have set our project up, we can use sbt on it. Running sbt starts the interactive mode (a command prompt with tab completion and history!).

Common sbt commands 1

clean Deletes all generated files in the target directory.

compile Compiles the main sources (in src/main/scala and src/main/java directories).

test Compiles and runs all tests.

console Starts the Scala interpreter with a classpath including the compiled sources and all dependencies. To return to sbt type :quit, Ctrl+D (Unix), or Ctrl+Z (Windows).

run <arguments> Runs the main class for the project in the same virtual machine as sbt.

package Creates a jar file containing the files in src/main/resources and the classes compiled from src/main/scala and src/main/java.

help <command> Displays detailed help for the specified command. If no command is provided, displays brief descriptions of all commands.

reload Reloads the build definition (build.sbt, project/.scala, project/.sbt files). Needed if you change the build definition.

Defining build.sbt

This link shows the basics of sbt definitions.

Adding library dependencies

Let’s show one of the features we use the most. In order to include external libraries to our project which will be linked at compile time to our sources.

To depend on third-party libraries, there are two options. The first is to drop jars in lib/ (unmanaged dependencies) and the other is to add managed dependencies, which will look like this in build.sbt file:

libraryDependencies += "org.apache.derby" % "derby" % "10.4.1.3"

This is how you add a managed dependency on the Apache Derby library, version 10.4.1.3.

The libraryDependencies key involves two complexities: += rather than :=, and the % method. += appends to the key’s old value rather than replacing it, this is explained in more about settings. The % method is used to construct an Ivy module ID from strings, explained in library dependencies.

Using a test framework

So far, we have been using the ScalaTest Framework. After some research, we ended using this framework because was completed written in Scala offering full compatibility plus its syntax is driven by Scala standards and not Java styles like others testing frameworks we tried out.

In order to use this testing framework we added these lines to the build.sbt file:

libraryDependencies += "org.scalamock" %% "scalamock-scalatest-support" % "3.0.1" % "test"

libraryDependencies += "org.scalamock" %% "scalamock-specs2-support" % "3.0.1" % "test"

Now, we can create a class like this:

class BatchCollectorTest extends FlatSpec with ShouldMatchers { "Collector" should "not collect from empty source" in {

val source = Seq[Int]() val batches = new Collector(2).Partition(source) batches.toIndexedSeq.length should be (0)

} "Collector" should "return one batch" in {

val source = Seq[Int](1,2) val batches = new Collector(2).Partition(source) batches.toIndexedSeq.length should be (1)

} "Collector" should "return 1 batch" in {

val source = Seq[Int](1 ) val batches = new Collector(2).Partition(source) batches.toIndexedSeq.length should be (1)

}

}

After this, if we run sbt test, sbt should run the tests in this class.

Working with Spark

Let’s now starting using the Spark libraries in order to create and Spark application in Scala. First, we need to include the required libraries to our build.sbt file like this: libraryDependencies += “org.apache.spark” % “spark-core_2.10” % “1.3.1” % “provided” Note that this library has been marked as “provided” since it will be present by default in the Spark Engine at the moment of the submission of our application. Once the library has been include, we can run:

clean

compile

in the sbt prompt so sbt can download the required dependencies and allow us access to its content from our source code. At this moment we have ready an app, we can do:

compile

test

package

This will run our tests and if everything went fine, it will package our code into a .jar file inside the target/ directory.

Linking external libraries

Sometimes, we need to link libraries which are not the maven repository so they cannot be reference using libraryDependencies. The Teradata JDBC Driver is a clear example of this. Teradata offers a .jar library that can be downloaded directly from their website. In order to reference this library from our source code, we can follow two different approaches, one is to modify our build.sbt file so we can tell to sbt where the library is located to be linked at compilation. The second approach is to copy the .jar library into the lib/ directory. sbt will look at this folder by default when compilation. IntelliJ also look at this folder in order to offer us code completion and other IDE features.

Running our spark application

Runnig Spark application requires to have a working Spark engine first. We can get Spark from the Spark Website or install it using Ambari in a Hadoop cluster.

This is how to submit application to Spark, but it import to note the deploy-mode option and the — jars options. If we need to submit additional libraries such as the one used for Teradata, we must specify the locations of the libraries using — jars. Multiple libraries can be sent to the Spark engine, but their paths have to be separated by , without blank spaces.

Monitoring our application execution

Spark application running in cluster mode can be monitored using the Yarn operating system in the Hadoop cluster. Application running in Stand Alone mode can be monitoring the Spark history server in our localhost, the Url can be found the terminal once we have ran the spark-submit command.

Spark Shell

In Standalone mode, we can run spark-shell command to start the Spark engine. Once it has started, we have fully spark waiting for our instructions that can be typed directly into the terminal. the Spark Context can be used as sc and the SQL Context is at sqlContext.