When using Spark Streaming one can often find a situation in spark doesn’t support data source that you are trying to integrate. For such situations, spark provides a Receiver class that can be run on worker nodes to receive external data. This requires the developer to implement a receiver that is customized for receiving data from the concerned data source.

This starts with implementing a Receiver (Scala doc, Java doc). A custom receiver must extend this abstract class by implementing two methods

onStart(): Things to do to start receiving data.

onStop(): Things to do to stop receiving data.

Once the data is received, that data can be stored inside Spark by calling store(data), which is a method provided by the Receiver class. There are a number of flavors of store() which allow one to store the received data record-at-a-time or a whole collection of objects / serialized bytes.

class SparkKafkaReceiver(topic: String, kafkaParams: Map[String, Object]) extends Receiver[ConsumerRecord[String, String]](StorageLevel.MEMORY_AND_DISK_2)

This is a class which is extending spark Receiver class of type ConsumerRecord[String,String] and taking a StorageLevel parameter which is a singleton object contains some static constants for commonly used for defining storage levels.

override def onStart(): Unit = { new Thread("Custom Receiver") { override def run(): Unit = { Try { val consumer = new KafkaConsumer[String, String](kafkaParams) consumer.subscribe(List(topic)) while (!isStopped) { val records = consumer.poll(1000L) records.foreach(store) } } match { case e: ConnectException => restart("Error connecting to...", e) case t: Throwable => restart("Error receiving data", t) } } }.start() }

override def onStop(): Unit = { println("Nothing") }

Both onStart() and onStop() must not block indefinitely. Typically, onStart() would start the threads that are responsible for receiving the data. The receiving threads can also use isStopped(), a Receiver method, to check whether they should stop receiving data.

Any exception in the receiving threads should be caught and handled properly to avoid silent failures of the receiver. restart(<exception>) will restart the receiver by asynchronously calling onStop() and then calling onStart() after a delay. stop(<exception>) will call onStop() and terminate the receiver.

object SparkCustomReceiver extends App{ def getKafkaParams: Map[String, Object] = { Map[String, Object]( "auto.offset.reset" -> "earliest", "bootstrap.servers" -> "localhost:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "group3" ) } val properties = new Properties() properties.put("bootstrap.servers", "localhost:9092") properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer") val topic1 = "topic1" val topic2 = "topic2" val producer = new KafkaProducer[String, String](properties) val spark: SparkSession = SparkSession.builder.appName("Custom Receiver").master("local[*]").getOrCreate() val ssc: StreamingContext = new StreamingContext(spark.sparkContext, Seconds(10L)) val stream = ssc.receiverStream(new SparkKafkaReceiver(topic1, getKafkaParams)) stream.map(consumerRecord => producer.send(new ProducerRecord[String, String](topic2, consumerRecord.key, consumerRecord.value))) stream.print() ssc.start() ssc.awaitTermination() }

getkafkaParams and properties are for configuring KafkaConsumer and KafkaProducer for receiving and sending data on different kafka topic.

The SparkSession provides a single point of entry to interact with underlying Spark functionality and here it is also used for creating StreamingContext or one can also use SparkConf to create StreamingConetxt.

ssc.receiverStream creates an input stream with any arbitrary user implemented receiver. In this case, it is SparkKafkaReceiver class.

After creating a ReceiverInputDStream[ConsumerRecord[String, String]] stream a simple map is used in which consumerRecord is been sent to kafka topic named topic2 via KafkaProducer object.

Finally, ssc.start kick starts the streaming and ssc.awaitTermination will keep this streaming in a running state until it is terminated.