More Kafka and Spark, please!

Hello, world!

Having joined Rittman Mead more than 6 years ago, the time has come for my first blog post. Let me start by standing on the shoulders of blogging giants, revisiting Robin's old blog post Getting Started with Spark Streaming, Python, and Kafka.

The blog post was very popular, touching on the subjects of Big Data and Data Streaming. To put my own twist on it, I decided to:

not use Twitter as my data source, because there surely must be other interesting data sources out there,

use Scala, my favourite programming language, to see how different the experience is from using Python.

Why Scala?

Scala is admittedly more challenging to master than Python. However, because Scala compiles into Java bytecode, it can be used pretty much anywhere where Java is being used. And Java is being used everywhere. Python is arguably even more widely used than Java, however it remains a dynamically typed scripting language that is easy to write in but can be hard to debug.

Is there a case for using Scala instead of Python for the job? Both Spark and Kafka were written in Scala (and Java), hence they should get on like a house on fire, I thought. Well, we are about to find out.

My data source: OpenWeatherMap

When it comes to finding sample data sources for data analysis, the selection out there is amazing. At the time of this writing, Kaggle offers freely available 14,470 datasets, many of them in easy-to-digest formats like CSV and JSON. However, when it comes to real-time sample data streams, the selection is quite limited. Twitter is usually the go-to choice - easily accessible and well documented. Too bad I decided not to use Twitter as my source.

Another alternative is the Wikipedia Recent changes stream. Although in the stream schema there are a few values that would be interesting to analyse, overall this stream is more boring than it sounds - the text changes themselves are not included.

Fortunately, I came across the OpenWeatherMap real-time weather data website. They have a free API tier, which is limited to 1 request per second, which is quite enough for tracking changes in weather. Their different API schemas return plenty of numeric and textual data, all interesting for analysis. The APIs work in a very standard way - first you apply for an API key. With the key you can query the API with a simple HTTP GET request (Apply for your own API key instead of using the sample one - it is easy.):

This request

https://samples.openweathermap.org/data/2.5/weather?q=London,uk&appid=b6907d289e10d714a6e88b30761fae22

gives the following result:

{ "coord": {"lon":-0.13,"lat":51.51}, "weather":[ {"id":300,"main":"Drizzle","description":"light intensity drizzle","icon":"09d"} ], "base":"stations", "main": {"temp":280.32,"pressure":1012,"humidity":81,"temp_min":279.15,"temp_max":281.15}, "visibility":10000, "wind": {"speed":4.1,"deg":80}, "clouds": {"all":90}, "dt":1485789600, "sys": {"type":1,"id":5091,"message":0.0103,"country":"GB","sunrise":1485762037,"sunset":1485794875}, "id":2643743, "name":"London", "cod":200 }

Getting data into Kafka - considering the options

There are several options for getting your data into a Kafka topic. If the data will be produced by your application, you should use the Kafka Producer Java API. You can also develop Kafka Producers in .Net (usually C#), C, C++, Python, Go. The Java API can be used by any programming language that compiles to Java bytecode, including Scala. Moreover, there are Scala wrappers for the Java API: skafka by Evolution Gaming and Scala Kafka Client by cakesolutions.

OpenWeatherMap is not my application and what I need is integration between its API and Kafka. I could cheat and implement a program that would consume OpenWeatherMap's records and produce records for Kafka. The right way of doing that however is by using Kafka Source connectors, for which there is an API: the Connect API. Unlike the Producers, which can be written in many programming languages, for the Connectors I could only find a Java API. I could not find any nice Scala wrappers for it. On the upside, the Confluent's Connector Developer Guide is excellent, rich in detail though not quite a step-by-step cookbook.

However, before we decide to develop our own Kafka connector, we must check for existing connectors. The first place to go is Confluent Hub. There are quite a few connectors there, complete with installation instructions, ranging from connectors for particular environments like Salesforce, SAP, IRC, Twitter to ones integrating with databases like MS SQL, Cassandra. There is also a connector for HDFS and a generic JDBC connector. Is there one for HTTP integration? Looks like we are in luck: there is one! However, this connector turns out to be a Sink connector.

Ah, yes, I should have mentioned - there are two flavours of Kafka Connectors: the Kafka-inbound are called Source Connectors and the Kafka-outbound are Sink Connectors. And the HTTP connector in Confluent Hub is Sink only.

Googling for Kafka HTTP Source Connectors gives few interesting results. The best I could find was Pegerto's Kafka Connect HTTP Source Connector. Contrary to what the repository name suggests, the implementation is quite domain-specific, for extracting Stock prices from particular web sites and has very little error handling. Searching Scaladex for 'Kafka connector' does yield quite a few results but nothing for http. However, there I found Agoda's nice and simple Source JDBC connector (though for a very old version of Kafka), written in Scala. (Do not use this connector for JDBC sources, instead use the one by Confluent.) I can use this as an example to implement my own.

Creating a custom Kafka Source Connector

The best place to start when implementing your own Source Connector is the Confluent Connector Development Guide. The guide uses JDBC as an example. Our source is a HTTP API so early on we must establish if our data source is partitioned, do we need to manage offsets for it and what is the schema going to look like.

Partitions

Is our data source partitioned? A partition is a division of source records that usually depends on the source medium. For example, if we are reading our data from CSV files, we can consider the different CSV files to be a natural partition of our source data. Another example of partitioning could be database tables. But in both cases the best partitioning approach depends on the data being gathered and its usage. In our case, there is only one API URL and we are only ever requesting current data. If we were to query weather data for different cities, that would be a very good partitioning - by city. Partitioning would allow us to parallelise the Connector data gathering - each partition would be processed by a separate task. To make my life easier, I am going to have only one partition.

Offsets

Offsets are for keeping track of the records already read and the records yet to be read. An example of that is reading the data from a file that is continuously being appended - there can be rows already inserted into a Kafka topic and we do not

want to process them again to avoid duplication. Why would that be a problem? Surely, when going through a source file row by row, we know which row we are looking at. Anything above the current row is processed, anything below - new records. Unfortunately, most of the time it is not as simple as that: first of all Kafka supports concurrency, meaning there can be more than one Task busy processing Source records. Another consideration is resilience - if a Kafka Task process fails,

another process will be started up to continue the job. This can be an important consideration when developing a Kafka Source Connector.

Is it relevant for our HTTP API connector? We are only ever requesting current weather data. If our process fails, we may miss some time periods but we cannot recover then later on. Offset management is not required for our simple connector.

So that is Partitions and Offsets dealt with. Can we make our lives just a bit more difficult? Fortunately, we can. We can create a custom Schema and then parse the source data to populate a Schema-based Structure. But we will come to that later.

First let us establish the Framework for our Source Connector.

Source Connector - the Framework

The starting point for our Source Connector are two Java API classes: SourceConnector and SourceTask . We will put them into separate .scala source files but they are shown here together:

import org.apache.kafka.connect.source.{SourceConnector, SourceTask} class HttpSourceConnector extends SourceConnector {...} class HttpSourceTask extends SourceTask {...}

These two classes will be the basis for our Source Connector implementation:

HttpSourceConnector represents the Connector process management. Each Connector process will have only one SourceConnector instance.

HttpSourceTask represents the Kafka task doing the actual data integration work. There can be one or many Tasks active for an active SourceConnector instance.

We will have some additional classes for config and for HTTP access.

But first let us look at each of the two classes in more detail.

SourceConnector class

SourceConnector is an abstract class that defines an interface that our HttpSourceConnector needs to adhere to. The first function we need to override is config :

private val configDef: ConfigDef = new ConfigDef() .define(HttpSourceConnectorConstants.HTTP_URL_CONFIG, Type.STRING, Importance.HIGH, "Web API Access URL") .define(HttpSourceConnectorConstants.API_KEY_CONFIG, Type.STRING, Importance.HIGH, "Web API Access Key") .define(HttpSourceConnectorConstants.API_PARAMS_CONFIG, Type.STRING, Importance.HIGH, "Web API additional config parameters") .define(HttpSourceConnectorConstants.SERVICE_CONFIG, Type.STRING, Importance.HIGH, "Kafka Service name") .define(HttpSourceConnectorConstants.TOPIC_CONFIG, Type.STRING, Importance.HIGH, "Kafka Topic name") .define(HttpSourceConnectorConstants.POLL_INTERVAL_MS_CONFIG, Type.STRING, Importance.HIGH, "Polling interval in milliseconds") .define(HttpSourceConnectorConstants.TASKS_MAX_CONFIG, Type.INT, Importance.HIGH, "Kafka Connector Max Tasks") .define(HttpSourceConnectorConstants.CONNECTOR_CLASS, Type.STRING, Importance.HIGH, "Kafka Connector Class Name (full class path)") override def config: ConfigDef = configDef

This is validation for all the required configuration parameters. We also provide a description for each configuration parameter, that will be shown in the missing configuration error message.

HttpSourceConnectorConstants is an object where config parameter names are defined - these configuration parameters must be provided in the connector configuration file:

object HttpSourceConnectorConstants { val HTTP_URL_CONFIG = "http.url" val API_KEY_CONFIG = "http.api.key" val API_PARAMS_CONFIG = "http.api.params" val SERVICE_CONFIG = "service.name" val TOPIC_CONFIG = "topic" val TASKS_MAX_CONFIG = "tasks.max" val CONNECTOR_CLASS = "connector.class" val POLL_INTERVAL_MS_CONFIG = "poll.interval.ms" val POLL_INTERVAL_MS_DEFAULT = "5000" }

Another simple function to be overridden is taskClass - for the SourceConnector class to know its corresponding SourceTask class.

override def taskClass(): Class[_ <: SourceTask] = classOf[HttpSourceTask]

The last two functions to be overridden here are start and stop . These are called upon the creation and termination of a SourceConnector instance (not Task instance). JavaMap here is an alias for java.util.Map - a Java Map, which is not to be confused with the native Scala Map - that cannot be used here. (If you are a Python developer, a Map in Java/Scala is similar to the Python dictionary , but strongly typed.) The interface requires Java data structures, but that is fine - we can convert them from one to another. By far the biggest problem here is the assignment of the connectorConfig variable - we cannot have a functional programming friendly immutable value here. The variable is defined at the class level

private var connectorConfig: HttpSourceConnectorConfig = _

and is set in the start function and then referred to in the taskConfigs function further down. This does not look pretty in Scala. Hopefully somebody will write a Scala wrapper for this interface.

Because there is no logout/shutdown/sign-out required for the HTTP API, the stop function just writes a log message.

override def start(connectorProperties: JavaMap[String, String]): Unit = { Try (new HttpSourceConnectorConfig(connectorProperties.asScala.toMap)) match { case Success(cfg) => connectorConfig = cfg case Failure(err) => connectorLogger.error(s"Could not start Kafka Source Connector ${this.getClass.getName} due to error in configuration.", new ConnectException(err)) } } override def stop(): Unit = { connectorLogger.info(s"Stopping Kafka Source Connector ${this.getClass.getName}.") }

HttpSourceConnectorConfig is a thin wrapper class for the configuration.

We are almost done here. The last function to be overridden is taskConfigs .

This function is in charge of producing (potentially different) configurations for different Source Tasks. In our case, there is no reason for the Source Task configurations to differ. In fact, our HTTP API will benefit little from parallelism, so, to keep things simple, we can assume the number of tasks always to be 1.

override def taskConfigs(maxTasks: Int): JavaList[JavaMap[String, String]] = List(connectorConfig.connectorProperties.asJava).asJava

The name of the taskConfigs function was changed in the Kafka version 2.1.0 - please consider that when using this code for older Kafka versions.

Source Task class

In a similar manner to the Source Connector class, we implement the Source Task abstract class. It is only slightly more complex than the Connector class.

Just like for the Connector, there are start and stop functions to be overridden for the Task.

Remember the taskConfigs function from above? This is where task configuration ends up - it is passed to the Task's start function. Also, similarly to the Connector's start function, we parse the connection properties with HttpSourceTaskConfig , which is the same as HttpSourceConnectorConfig - configuration for Connector and Task in our case is the same.

We also set up the Http service that we are going to use in the poll function - we create an instance of the WeatherHttpService class. (Please note that start is executed only once, upon the creation of the task and not every time a record is polled from the data source.)

override def start(connectorProperties: JavaMap[String, String]): Unit = { Try(new HttpSourceTaskConfig(connectorProperties.asScala.toMap)) match { case Success(cfg) => taskConfig = cfg case Failure(err) => taskLogger.error(s"Could not start Task ${this.getClass.getName} due to error in configuration.", new ConnectException(err)) } val apiHttpUrl: String = taskConfig.getApiHttpUrl val apiKey: String = taskConfig.getApiKey val apiParams: Map[String, String] = taskConfig.getApiParams val pollInterval: Long = taskConfig.getPollInterval taskLogger.info(s"Setting up an HTTP service for ${apiHttpUrl}...") Try( new WeatherHttpService(taskConfig.getTopic, taskConfig.getService, apiHttpUrl, apiKey, apiParams) ) match { case Success(service) => sourceService = service case Failure(error) => taskLogger.error(s"Could not establish an HTTP service to ${apiHttpUrl}") throw error } taskLogger.info(s"Starting to fetch from ${apiHttpUrl} each ${pollInterval}ms...") running = new JavaBoolean(true) }

The Task also has the stop function. But, just like for the Connector, it does not do much, because there is no need to sign out from an HTTP API session.

Now let us see how we get the data from our HTTP API - by overriding the poll function.

The fetchRecords function uses the sourceService HTTP service initialised in the start function. sourceService 's sourceRecords function requests data from the HTTP API.

override def poll(): JavaList[SourceRecord] = this.synchronized { if(running.get) fetchRecords else null } private def fetchRecords: JavaList[SourceRecord] = { taskLogger.debug("Polling new data...") val pollInterval = taskConfig.getPollInterval val startTime = System.currentTimeMillis val fetchedRecords: Seq[SourceRecord] = Try(sourceService.sourceRecords) match { case Success(records) => if(records.isEmpty) taskLogger.info(s"No data from ${taskConfig.getService}") else taskLogger.info(s"Got ${records.size} results for ${taskConfig.getService}") records case Failure(error: Throwable) => taskLogger.error(s"Failed to fetch data for ${taskConfig.getService}: ", error) Seq.empty[SourceRecord] } val endTime = System.currentTimeMillis val elapsedTime = endTime - startTime if(elapsedTime < pollInterval) Thread.sleep(pollInterval - elapsedTime) fetchedRecords.asJava }

Phew - that is the interface implementation done. Now for the fun part...

Requesting data from OpenWeatherMap's API

The fun part is rather straightforward. We use the scalaj.http library to issue a very simple HTTP request and get a response.

Our WeatherHttpService implementation will have two functions:

httpServiceResponse that will format the request and get data from the API

that will format the request and get data from the API sourceRecords that will parse the Schema and wrap the result within the Kafka SourceRecord class.

Please note that error handling takes place in the fetchRecords function above.

override def sourceRecords: Seq[SourceRecord] = { val weatherResult: HttpResponse[String] = httpServiceResponse logger.info(s"Http return code: ${weatherResult.code}") val record: Struct = schemaParser.output(weatherResult.body) List( new SourceRecord( Map(HttpSourceConnectorConstants.SERVICE_CONFIG -> serviceName).asJava, // partition Map("offset" -> "n/a").asJava, // offset topic, schemaParser.schema, record ) ) } private def httpServiceResponse: HttpResponse[String] = { @tailrec def addRequestParam(accu: HttpRequest, paramsToAdd: List[(String, String)]): HttpRequest = paramsToAdd match { case (paramKey,paramVal) :: rest => addRequestParam(accu.param(paramKey, paramVal), rest) case Nil => accu } val baseRequest = Http(apiBaseUrl).param("APPID",apiKey) val request = addRequestParam(baseRequest, apiParams.toList) request.asString }

Parsing the Schema

Now the last piece of the puzzle - our Schema parsing class.

The short version of it, which would do just fine, is just 2 lines of class (actually - object) body:

object StringSchemaParser extends KafkaSchemaParser[String, String] { override val schema: Schema = Schema.STRING_SCHEMA override def output(inputString: String) = inputString }

Here we say we just want to use the pre-defined STRING_SCHEMA value as our schema definition. And pass inputString straight to the output, without any alteration.

Looks too easy, does it not? Schema parsing could be a big part of Source Connector implementation. Let us implement a proper schema parser. Make sure you read the Confluent Developer Guide first.

Our schema parser will be encapsulated into the WeatherSchemaParser object. KafkaSchemaParser is a trait with two type parameters - inbound and outbound data type. This indicates that the Parser receives data in String format and the result is a Kafka's Struct value.

object WeatherSchemaParser extends KafkaSchemaParser[String, Struct]

The first step is to create a schema value with the SchemaBuilder. Our schema is rather large, therefore I will skip most fields. The field names given are a reflection of the hierarchy structure in the source JSON. What we are aiming for is a flat, table-like structure - a likely Schema creation scenario.

For JSON parsing we will be using the Scala Circle library, which in turn is based on the Scala Cats library. (If you are a Python developer, you will see that Scala JSON parsing is a bit more involved (this might be an understatement), but, on the flipside, you can be sure about the result you are getting out of it.)

override val schema: Schema = SchemaBuilder.struct().name("weatherSchema") .field("coord-lon", Schema.FLOAT64_SCHEMA) .field("coord-lat", Schema.FLOAT64_SCHEMA) .field("weather-id", Schema.FLOAT64_SCHEMA) .field("weather-main", Schema.STRING_SCHEMA) .field("weather-description", Schema.STRING_SCHEMA) .field("weather-icon", Schema.STRING_SCHEMA) // ... .field("rain", Schema.FLOAT64_SCHEMA) // ...

Next we define case classes, into which we will be parsing the JSON content.

case class Coord(lon: Double, lat: Double) case class WeatherAtom(id: Double, main: String, description: String, icon: String)

That is easy enough. Please note that the case class attribute names match one-to-one with the attribute names in JSON. However, our Weather JSON schema is rather relaxed when it comes to attribute naming. You can have names like type and 3h , both of which are invalid value names in Scala. What do we do? We give the attributes valid Scala names and then implement a decoder:

case class Rain(threeHours: Double) object Rain { implicit val decoder: Decoder[Rain] = Decoder.instance { h => for { threeHours <- h.get[Double]("3h") } yield Rain( threeHours ) } }

The rain case class is rather short, with only one attribute. The corresponding JSON name was 3h . We map '3h' to the Scala attribute threeHours .

Not quite as simple as JSON parsing in Python, is it?

In the end, we assemble all sub-case classes into the WeatherSchema case class, representing the whole result JSON.

case class WeatherSchema( coord: Coord, weather: List[WeatherAtom], base: String, mainVal: Main, visibility: Double, wind: Wind, clouds: Clouds, dt: Double, sys: Sys, id: Double, name: String, cod: Double )

Now, the parsing itself. (Drums, please!)

structInput here is the input JSON in String format. WeatherSchema is the case class we created above. The Circle decode function returns a Scala Either monad, error on the Left(), successful parsing result on the Right() - nice and tidy. And safe.

val weatherParsed: WeatherSchema = decode[WeatherSchema](structInput) match { case Left(error) => { logger.error(s"JSON parser error: ${error}") emptyWeatherSchema } case Right(weather) => weather }

Now that we have the WeatherSchema object, we can construct our Struct object that will become part of the SourceRecord returned by the sourceRecords function in the WeatherHttpService class. That in turn is called from the HttpSourceTask's poll function that is used to populate the Kafka topic.

val weatherStruct: Struct = new Struct(schema) .put("coord-lon", weatherParsed.coord.lon) .put("coord-lat", weatherParsed.coord.lat) .put("weather-id", weatherParsed.weather.headOption.getOrElse(emptyWeatherAtom).id) .put("weather-main", weatherParsed.weather.headOption.getOrElse(emptyWeatherAtom).main) .put("weather-description", weatherParsed.weather.headOption.getOrElse(emptyWeatherAtom).description) .put("weather-icon", weatherParsed.weather.headOption.getOrElse(emptyWeatherAtom).icon) // ...

Done!

Considering that Schema parsing in our simple example was optional, creating a Kafka Source Connector for us meant creating a Source Connector class, a Source Task class and a Source Service class.

Creating JAR(s)

JAR creation is described in the Confluent's Connector Development Guide. The guide mentions two options - either all the library dependencies can be added to the target JAR file, a.k.a an 'uber-Jar'. Alternatively, the dependencies can be copied to the target folder. In that case they must all reside in the same folder, with no subfolder structure. For no particular reason, I went with the latter option.

The Developer Guide says it is important not to include the Kafka Connect API libraries there. (Instead they should be added to CLASSPATH.) Please note that for the latest Kafka versions it is advised not to add these custom JARs to CLASSPATH. Instead, we will add them to connectors' plugin.path. But that we will leave for another blog post.

Scala - was it worth using it?

Only if you are a big fan. The code I wrote is very Java-like and it might have been better to write it in Java. However, if somebody writes a Scala wrapper for the Connector interfaces, or, even better, if a Kafka Scala API is released, writing Connectors in Scala would be a very good choice.