I thought that would be a nice fun first little project to try to use Groovy to run a Spark job on Google Cloud Dataproc! Dataproc manages Hadoop & Spark for you: it’s a service that provides managed Apache Hadoop, Apache Spark, Apache Pig and Apache Hive. You can easily process big datasets at low cost, control those costs by quickly creating managed clusters of any size and turning them off where you’re done. In addition, you can obviously use all the other Google Cloud Platform services and products from Dataproc (ie. store the big datasets in Google Cloud Storage, on HDFS, through BigQuery, etc.)

More concretely,, how do you run a Groovy job in Google Cloud Dataproc’s managed Spark service? Let’s see that in action!

To get started, I checked out Paolo’s samples from Github, and I even groovy-fied the Pi calculation example (based on this approach) to make it a bit more idiomatic:





package org.apache.spark.examples

import groovy.transform.CompileStatic

import org.apache.spark.SparkConf

import org.apache.spark.api.java.JavaSparkContext

import org.apache.spark.api.java.function.Function

import org.apache.spark.api.java.function.Function2

import scala.Function0

@CompileStatic

final class GroovySparkPi {

static void main(String[] args) throws Exception {

def sparkConf = new SparkConf().setAppName( "GroovySparkPi" )

def jsc = new JavaSparkContext(sparkConf)

int slices = (args. length == 1 ) ? Integer. parseInt (args[ 0 ]) : 2

int n = 100000 * slices

def dataSet = jsc.parallelize( 0 ..

def mapper = {

double x = Math. random () * 2 - 1

double y = Math. random () * 2 - 1

return (x * x + y * y < 1 ) ? 1 : 0

}

int count = dataSet

.map(mapper as Function )

.reduce({ int a, int b -> a + b} as Function2 )

println "Pi is roughly ${ 4.0 * count / n} "

jsc.stop()

}

}





