Spark On: Let's Code! (Part 1)

Big Data has become one of the hottest topics and challenges for many companies as dealing with large quantities of data in real-time is often a necessity these days. Spark is one of the most popular frameworks used to resolve problems arising during the processing of Big Data.

At 47 Degrees, we started using Spark to concretely provide real-time streaming data for some of our projects. There’s an abundance of benefits to using Spark which is why we decided to launch a series of articles on our experiences and how you can benefit from this framework. First, we’re going to create a problem that we can deal with, in this case, we’ll use the Twitter Streaming API because we know it’s an infinite source of data. Then we’ll begin by discussing how we can codify our solution. Later in the series, we’ll dive into distributing, deploying, and monitoring these kinds of applications.

First Approach

Although there are several ways to achieve the same result, we will use this proposed solution where the main points are as follows:

API HTTP to start, stop and query the streaming.

Micro services to deal with the Twitter Streaming.

Cassandra Persistence layer where we’re going to store a few tables to complete the example.

As a note, we will discuss this on a deeper level when we cover the infrastructure in a later article. For now, let’s focus on the code.

Spark, Participants, and some Concepts

For those of you who aren’t familiar with Spark, we’ll explain the different components within the Spark Cluster.

Spark Driver

The Spark Driver handles the configuration of the cluster, preparing the context and declaring the operations over the Resilient Distributed Datasets (RDDs). In addition, it’s responsible for creating tasks and submitting them to the workers for execution. Essentially, it acts as the coordinator for the different job stages.

val conf = new SparkConf().setAppName("Our self-contained App") val sc = new SparkContext(conf)

The snippet above is an example of how a Spark Context is initialized.

Worker Nodes

Worker nodes (also referred to as slaves), allow us to scale out our Spark Cluster, if we need to, by increasing the number of available worker nodes.

Executors

A worker node can run one or more executors, which can be explained as a process within the worker node.

val data = Array(1, 2, 3, 4, 5) val distributedData = sc.parallelize(data)

The above example depicts how to create a parallelized collection with numbers ranging from one to five. Thus distributedData can be operated on, in parallel.

Tasks

Each executor runs tasks using the CPU, memory and disks over the worker nodes. The executors have assigned resources to run these tasks. Hence, a task can be considered as a unit of work that will be sent to one executor.

Following with the distributeData example, Spark will run one task for each partition of the cluster. For instance, for the reduce operation, Spark will break the computation into tasks to run on separate machines. Each machine runs both its part of the map and a local reduction, returning only its answer to the driver program (Spark Driver).

distributedData.reduce((a, b) => a + b)

A quick look at Spark Streaming

As you can see in the Spark docs, Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Taking this into account, the “glue” components to interconnect those participants in our model would be the following:

Receivers , we (developers) need develop these receivers as long-running tasks in order to receive data from the concerned data source.

, we (developers) need develop these receivers as long-running tasks in order to receive data from the concerned data source. DStream, input data received from streaming sources. Over these DStreams, we could apply transformations, actions, and output operations depending on our scope requirements.

We passed through these concepts relatively quickly, so we encourage you to look at the Cluster Overview in the Spark docs for further details

Who’s who in our Model

Let’s briefly sum up the correspondences in our example:

The Spark Driver (in practical terms, our SparkContext) will be placed on the API HTTP side.

The Driver launches executors to Spark Cluster (worker nodes). This cluster could be configured in several ways: YARN, Mesos or a Spark Standalone Cluster. In this first approach, we are just running locally.

Tasks are sent by the Driver to the executors for processing the data on the worker nodes.

Regarding the Spark Streaming:

TwitterReceiver is our proposed customer receiver in this example. As we’ll see later, this class will run on worker nodes to receive external data from Twitter.

TwitterInputDStream inputs tweets received from the Twitter streaming source.

Let’s take a quick look at how this works:

After booting up the application, the Spark driver runs receivers as long-running tasks.

Once the streaming has started, receivers divide the stream into blocks.

These blocks are replicated on other executors. Blocks are often considered as pieces of data, given a batch interval or window size. For each block, the Driver launches tasks to process them.

Applying these to some transformations will generate the output results. In our case, we’re going to save this in a persistence model, more accurately, in a Cassandra NoSQL Database.

Booting the App

Now let’s dig into the code. The following set of lines serve as the entry point of our application where we can see the component dependencies:

object Boot extends App with ApiHttpService with BootHelper { override implicit val system = ActorSystem() override implicit val executor = system.dispatcher override implicit val materializer = ActorMaterializer()(system) override implicit val ssc: StreamingContext = createStreamingContext override implicit val cassandraConnector: CassandraConnector = CassandraConnector(sparkConf) override implicit val twitterAuth: TwitterAuth = createTwitterAuth override implicit val twitterStreaming: TwitterInputDStream = TwitterStreamingServices.createTwitterStream() Http().bindAndHandle(routes, interface, port) }

This works to satisfy our API HTTP Services:

trait ApiHttpService extends Protocols { implicit val system: ActorSystem implicit def executor: ExecutionContextExecutor implicit val materializer: Materializer implicit val ssc: StreamingContext implicit val cassandraConnector: CassandraConnector implicit val twitterAuth: TwitterAuth implicit val twitterStreaming: TwitterInputDStream val routes = { ... } }

In short, we are saying our application needs to have:

ActorSystem, ExecutionContextExecutor, and Materializer as a part of the needed set of implicits to allow Akka HTTP to work properly.

StreamingContext , Spark Streaming Context is responsible for ingesting the information in our system.

, Spark Streaming Context is responsible for ingesting the information in our system. CassandraConnector will be used to store the data into our Cassandra Cluster database.

will be used to store the data into our Cassandra Cluster database. TwitterAuth stores the Twitter credentials from the app environment. Obviously, we cannot connect to the Twitter Streaming API without these settings.

TwitterInputDStream: Twitter stream.

In addition, we have provided the following helper trait, where we configure the Spark Context and create the Streaming Context:

trait BootHelper { val sparkConf = new «SparkConf»() .«setMaster»(sparkMaster) .«setAppName»(sparkAppName) .«setSparkHome»(sparkHome) .«setJars»(sparkOnJars) .«set("spark.executor.memory"», sparkExecutorMemory.toString) .«set("spark.cores.max"», sparkCoresMax.toString) .«set("spark.cassandra.connection.host"», cassandraHosts) .«set("spark.akka.heartbeat.interval"», sparkAkkaHeartbeatInterval.toString) .«set("spark.broadcast.factory"», "org.apache.spark.broadcast.HttpBroadcastFactory") def createStreamingContext: StreamingContext = new StreamingContext(conf = sparkConf, batchDuration = Seconds(streamingBatchInterval)) def createTwitterAuth: TwitterAuth = TwitterAuth(consumerKey = consumerKey, consumerSecret = consumerSecret, accessToken = accessToken, accessTokenSecret = accessTokenSecret) } »SparkConf|Configuration for a Spark application. Used to set various Spark parameters as key-value pairs. Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.« »SparkConf.setMaster|The master URL to connect to, such as local to run locally with one thread, local[4] to run locally with 4 cores, or spark://master:7077 to run on a Spark standalone cluster.« »SparkConf.setAppName|Set a name for your application.« »SparkConf.setSparkHome|Set the location where Spark is installed on worker nodes.« »SparkConf.setJars|Set JAR files to distribute to the cluster.« »spark.executor.memory|Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g).« »spark.cores.max|When running on a standalone deploy cluster or a Mesos cluster in coarse-grained sharing mode, the maximum amount of CPU cores to request for the application from across the cluster (not from each machine). If not set, the default will be spark.deploy.defaultCores on Spark's standalone cluster manager, or infinite (all available cores) on Mesos.« »spark.cassandra.connection.host|Contact point to connect to the Cassandra cluster.« »spark.akka.heartbeat.interval|This is set to a larger value to disable failure detector that comes inbuilt akka.« »spark.broadcast.factory|Which broadcast implementation to use.«

Notice that the createStreamingContext function will create a StreamingContext by providing the configuration necessary for a new SparkContext. In this case, it’s based on the previous configuration (SparkConf) and the predefined batch duration, which is directly related to the time interval at which streaming data will be divided into batches.

The REST API is quite simple and has just three endpoints:

GET /twitter-streaming: to query the streaming status POST /twitter-streaming: to start the streaming DELETE /twitter-streaming: to stop the streaming context.

You can take a look at the entire implementation in the repo code: com.fortysevendeg.sparkon.api.http.ApiHttpService.

Streaming Custom Receiver

Given our entry point, let’s talk about our custom Twitter receiver. As you can see in the following piece of code, we are defining a new class TwitterInputDStream extending from ReceiverInputDStream. This provides an interface to start an InputDStream on the worker nodes and will receive all of the data on those nodes instead of in the Driver node. This is an important difference in respect to the InputDStream because this one is directly executed on the same node as the driver program. This could be a major issue if we receive huge quantities of information as the Spark Driver node would be overloaded, and the system would eventually fail.

class TwitterInputDStream( ssc: StreamingContext, twitterAuth: TwitterAuth, filters: Seq[String], storageLevel: StorageLevel ) extends ReceiverInputDStream[Status](ssc) { private[this] val authorization = new OAuthAuthorization(new ConfigurationBuilder() .setOAuthConsumerKey(twitterAuth.consumerKey) .setOAuthConsumerSecret(twitterAuth.consumerSecret) .setOAuthAccessToken(twitterAuth.accessToken) .setOAuthAccessTokenSecret(twitterAuth.accessTokenSecret) .build()) «override def getReceiver»(): Receiver[Status] = new TwitterReceiver( twitterAuth = authorization, filters = filters, storageLevel = storageLevel) } »ReceiverInputDStream.getReceiver|Gets the receiver object that will be sent to the worker nodes to receive data. This method needs to defined by any specific implementation of a ReceiverInputDStream.«

In the code above, we overrode the needed method getReceiver from the ReceiverInputDStream class, providing a new TwitterReceiver instance. As an additional note, we’re passing an important parameter called storageLevel, which as the name suggests, defines the RDD storage level used to persist or chat the datasets across operations. There are several possibilities including on disk, in memory, serialized or not, etc. You can find the whole reference here: RDD Persistence.

Below is the custom receiver class definition:

class TwitterReceiver(twitterAuth: Authorization, filters: Seq[String], storageLevel: StorageLevel) extends Receiver[Status](storageLevel) with Logging { protected lazy val receiverActor = { val twitterStreamingServices = new TwitterStreamingServices {} getEnv .actorSystem .actorOf(Props(new SparkTwitterFSMActor(twitterStreamingServices)), "SparkTwitterFSMActor") } def «onStart()» { receiverActor receiverActor ! StartStreaming(receiver = this, twitterAuth = twitterAuth, filters = filters) } def «onStop()» { receiverActor ! StopStreaming receiverActor ! PoisonPill } } »Receiver.onStart|This method is called by the system when the receiver is started. This function must initialize all resources (threads, buffers, etc.) necessary for receiving data. This function must be non-blocking, so receiving the data must occur on a different thread. Received data can be stored with Spark by calling <code>store(data)</code>.« »Receiver.onStop|This method is called by the system when the receiver is stopped. All resources (threads, buffers, etc.) setup in <code>onStart()</code> must be cleaned up in this method.«

Each custom receiver can be defined by defining the methods onStart() and onStop().

onStart() should define the setup steps necessary to start receiving data,

should define the setup steps necessary to start receiving data, and onStop() should define the cleanup steps necessary to stop receiving data.

In this example, we’re delegating the responsibility of start/stop streaming into an FSM Actor. If we examine the docs further, we will be able to see that the FSM Actors are based in Finite State Machines (FSM), which the Erlang documentation says:

An FSM can be described as a set of relations of the form: State(S) x Event(E) -> Actions (A), State(S')

It’s true that we could just use a single class with a var status or even use a custom ActorStream Receiver, but it was amusing doing in the FSM Actor way. As you can see, we’ve added nothing related to the actor supervision or cluster failover strategies, keeping the example simple.

class SparkTwitterFSMActor(twitterStreamingServices: TwitterStreamingServices) extends FSM[State, Data] { «startWith(Stopped, NoStreamingData)» «when(Stopped)» { case Event(StopStreaming, _) => stay using NoStreamingData case Event(ss: StartStreaming, _) => val newTwitterStream: TwitterStream = twitterStreamingServices.getTwitterStream(ss.twitterAuth, ss.receiver) ss.filters match { case Nil => newTwitterStream.sample() case _ => val query = new FilterQuery query.track(ss.filters.toArray) newTwitterStream.filter(query) } «goto(Streaming)» using StreamingData(stream = newTwitterStream) case _ => logger.warn("Case not expected in 'Stopped' State") stay() } when(Streaming) { case Event(ss: StartStreaming, sd: StreamingData) => stay using sd case Event(StopStreaming, sd: StreamingData) => sd.stream.shutdown() goto(Stopped) using NoStreamingData case _ => logger.warn("Case not expected in 'Streaming' State") «stay()» } } class TwitterStatusListener(receiver: TwitterReceiver) extends StatusListener { def onStatus(status: Status): Unit = receiver.store(status) def onDeletionNotice(statusDeletionNotice: StatusDeletionNotice) {} def onTrackLimitationNotice(i: Int) {} def onScrubGeo(l: Long, l1: Long) {} def onStallWarning(stallWarning: StallWarning) {} def onException(e: Exception) { receiver.restart("Unexpected error receiving tweets", e) } } »Initial State|The Actor initial state will be <code>Stopped</code>, where the streaming data will be <code>Empty(NoStreamingData)</code>.« »When|<code>when(state)</code>,Inserts a new StateFunction at the end of the processing chain for the given state. If the stateTimeout parameter is set, entering this state without a differing explicit timeout setting will trigger a StateTimeout event; the same is true when using <code>stay</code>.« »Change State|<code>goto(state)</code>, Produces transition to other state. Return this from a state function in order to effect the transition. Note all these methods are provided by Akka FSM DSL (Domain Specific Language).« »stay|Produce empty transition descriptor. Return this from a state function when no state change is to be effected.«

I know the code is self-explanatory, but let’s briefly summarize it anyway:

When the status is Stopped, and the receiver starts, Actor becomes running (Streaming) so the system will start with the Twitter ingestions from the Streaming API, filtering the tweets according to the application configuration.

When the status is Streaming (running), and stop message is received, the Actor will try to stop the streaming context and move the actor state to Stopped.

Data Processing

Here is where Spark offers its real power. It’s able to manage, in a faster way, huge quantities of data in near real-time fashion. In our simple example, we’re going to apply a couple of transformations to our batches of data that are coming in from the streaming.

def ingestTweets(topics: Set[String], windowSize: Duration) (implicit ssc: StreamingContext, dsStream: DStream[Status]) = { // dsStream -> «streaming_tweets_by_day» tweetsByDay(dsStream) // dsStream -> «streaming_tweets_by_track» tweetsByTrack(dsStream = dsStream, topics = topics, windowSize = windowSize) ssc.checkpoint(sparkCheckpoint) ssc.start() } »streaming_tweets_by_day|Cassandra table where we will store all tweets, partitioned by day.« »streaming_tweets_by_track|Cassandra table where we will store only the track which we are interested (configured as a part of our application).«

Given the dsStream, which we pointed out earlier, is no more than a continuous sequence of RDDs (of the same type Status) representing a continuous stream of data from the Twitter Streaming API.

In a short, we would say that the following functions are quite similar.

tweetsByDay function will store all the tweets from the Twitter Streaming API in the streaming_tweets_by_day Cassandra table.

function will store all the tweets from the Twitter Streaming API in the streaming_tweets_by_day Cassandra table. On the other hand, tweetsByTrack function will store streaming_tweets_by_track, the track words’ occurrences that we are filtering for, in another table.

Both functions have been implemented as follows:

def tweetsByDay(dsStream: DStream[Status]) { dsStream .«map»(toTweetsByDay) .«saveToCassandra»( sparkCassandraKeyspace, "streaming_tweets_by_day", SomeColumns( "id", "user_id", "user_name", "user_screen_name", "created_timestamp", "created_day", "tweet_text", "lang", "retweet_count", "favorite_count", "latitude", "longitude")) } def tweetsByTrack(dsStream: DStream[Status], topics: Set[String], windowSize: Duration) { dsStream .«flatMap»(_.getText.toLowerCase.split( """\s+""")) .«filter»(topics.contains) .«countByValueAndWindow»(windowSize, windowSize) .«transform» { (rdd, time) => val dateParts = formatTime(time, dateFormat).split(dateFormatSplitter) map (_.toInt) rdd map { case (track, count) => toTweetsByTrack(dateParts, track, count) } } .saveToCassandra( sparkCassandraKeyspace, "streaming_tweets_by_track", SomeColumns( "track", "year", "month", "day", "hour", "minute", "count")) } »DStream.map|Return a new DStream by applying a function to all elements of this DStream.« »CassandraDStream.saveToCassandra|Performs (com.datastax.spark.connector.writer.WritableToCassandra) for each produced RDD. Uses specific column names with an additional batch size.« »DStream.flatMap|Return a new DStream by applying a function to all elements of this DStream, and then flattening the results.« »DStream.filter|Return a new DStream containing only the elements that satisfy a predicate.« »DStream.countByValueAndWindow|Return a new DStream in which each RDD contains the count of distinct elements in RDDs in a sliding window over this DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions. The first parameter <code>windowDuration</code> it corresponds with width of the window, whereas the second parameter, <code>slideDuration</code>, is the sliding interval of the window (i.e., the interval after which the new DStream will generate RDDs).« »DStream.transform|Return a new DStream in which each RDD is generated by applying a function on each RDD of 'this' DStream.«

Checkpointing your app

In our streaming application, within the spark context, two parts might fail:

Spark Workers (executors). In this case, the tasks and the receivers are restarted by Spark automatically (by Spark Driver).

Spark Driver, how do we recover? Recovering with Checkpointing is possible. We can save the DStreams to a fault-tolerant storage like HDFS or S3. This operation would be applied periodically. Therefore, in the case of a failed driver, we could restart it from this checkpointing storage.

This section could be an article in itself, so in the meantime you can read more about it in the spark docs reference here.

Conclusion

That’s it for now! So far, we’ve looked at how Spark framework helps with big data processing in a very simple way. We’ve just reviewed how one of the most famous streaming data sources ingests information, and without a doubt, this process will keep our code simple in similar scenarios.

This is a simple example, but throughout this series, we’re going to see how this architecture becomes a real solution. We will also talk about the new features in the Spark 1.5.0 release, like Backpressure. Finally, we will analyze the ingested data and deploy the project in a real cluster and cover topics such as data recovery and system monitoring. Stay tuned!

You can check out the entire code here.

Further References

Refer to the Spark Streaming Programming Guide for more information.