Processing Time Series Data in Real-Time with InfluxDB and Structured Streaming

This article focuses on how to utilize a popular open source database “Influxdb” along with spark-structured streaming to process, store and visualize data in real time. Here, we will go in detail over how to set up a single node instance of Influxdb, how to extend the Foreach writer of SPARK to use it to write to Influxdb and what one needs to keep in mind while designing an Influxdb database.

In the data world, one of the major trends which people want to see is how a metric progresses with time. This makes managing and handling a time series data (simply meaning where data values are co-dependent on time) a very important aspect of a Data Scientist’s life.

A lot of tools and databases have been developed around this idea of handling time series data in efficiently. During my recent project, I got to explore one such very popular open source database called “Influxdb”, and this post is about how to process real-time data with Influxdb and Spark.

Influxdb

As from the perspective of a definition

InfluxDB is used as a data store for any use case involving large amounts of time-stamped data, including DevOps monitoring, log data, application metrics, IoT sensor data, and real-time analytics.

From the scope of this article, I will not go into the details of how the database works and the algorithms being used by it, the details of which can be found here

In this article, I will focus mainly on installation, writing and reading capacity, writing through the Spark and behavior of influx with the volume of data.

Installation

Influxdb comes in 2 versions as a solution, open source which can be installed only on a single instance and enterprise edition, which is paid and can be installed on a cluster.

For a number of cases, open source edition is pretty useful and fulfills the requirements. A single instance installation of Influxdb is very simple. The steps I followed are different from what has been mentioned in the documentation (which I found a bit tricky to do installation), which are as following:

Download a rpm file of influxdb Install alien package if not installed with “sudo apt-get install alien” Get a .deb file from rpm with “alien name.rpm” install influx with “sudo dpkg -i name.deb” Start influx server with “sudo influxd” or with “sudo service influx start”

Hardware Sizing Guidelines

Influxdb has been generous enough to provide us with hardware sizing guidelines. The ones for a single instance node are as follows.

These guidelines are mentioned in much detail at

InfluxDB Basic Concepts

There are some important Influxdb concepts to understand here

1. Measurement: A measurement is loosely equivalent to the concept of a table in relational databases. Measurement is inside which a data is stored and a database can have multiple measurements. A measurement primarily consists of 3 types of columns Time, Tags and Fields 2. Time: A time is nothing but a column tracking timestamp to perform time series operations in a better way. The default is the Influxdb time which is in nanoseconds, however, it can be replaced with event time. 3. Tags: A tag is similar to an indexed column in a relational database. An important point to remember is that relational operations like WHERE, GROUP BY etc, can be performed on a column only if it is marked as a Tag 4. Fields: Fields are the columns on which mathematical operations such as sum, mean, non-negative derivative etc can be performed. However, in recent versions string values can also be stored as a field. 5. Series: A series is the most important concept of Influxdb. A series is a combination of tags, measurement, and retention policy (default of Influxdb). An Influxdb database performance is highly dependent on the number of unique series it contains, which in turn is the cardinality of tags x no. of measurement x retention policy

It is imperative to decide judiciously on which values to store as tags and which to store as fields as they are necessary for determining the kind of operations which can be performed and performance of the database itself.

Writing Data From Spark

Spark is the most popular and efficient open source tool in the field of big data processing at the moment. There are at present two open source implementation of InfluxDb sink available for writing data through structured streaming, chronicler and reactive-influx.

Both of these are efficient, the only problem with chronicler is to write data through chronicler one has to first convert/create an influx data point into influxdb line protocol, which sometimes becomes tricky to do with a large number of fields and string values. It is for this reason alone that I preferred reactive-influx.

To include reactive-influx in an sbt project just do

libraryDependencies ++= Seq(

"com.pygmalios" % "reactiveinflux-spark_2.11" % "1.4.0.10.0.5.1",

"com.typesafe.netty" % "netty-http-pipelining" % "1.1.4"

)

Make an entry into application.conf



url = "localhost

spark {

batchSize = 1000 // No of records to be send in each batch

}

} reactiveinflux {url = "localhost :8086/ spark {batchSize = 1000 // No of records to be send in each batch

To enable a spark-streaming query to write into Influxdb one needs to extend the Foreach writer available in Spark Structured Streaming. A pseudo-code for which is given below

import com.pygmalios.reactiveinflux._

import com.pygmalios.reactiveinflux.spark._

import org.apache.spark.SparkConf

import org.apache.spark.rdd.RDD

import org.apache.spark.streaming.dstream.DStream

import org.apache.spark.streaming.{Seconds, StreamingContext}

import org.joda.time.DateTime

import com.pygmalios.reactiveinflux.{ReactiveInfluxConfig, ReactiveInfluxDbName}

import com.pygmalios.reactiveinflux.sync.{SyncReactiveInflux, SyncReactiveInfluxDb}

import scala.concurrent.duration._ class influxDBSink(dbName: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {



var db:SyncReactiveInfluxDb = _

implicit val awaitAtMost = 1.second

// Define the database connection here

def open(partitionId: Long, version: Long): Boolean = {

val syncReactiveInflux =

SyncReactiveInflux(ReactiveInfluxConfig(None))

db = syncReactiveInflux.database(dbName);

db.create() // create the database



true

} // Write the process logic, and database commit code here

def process(value: org.apache.spark.sql.Row): Unit = { val point = Point(

time = time, // system or event time

measurement = "measurement1",

tags = Map(

"t1" -> "A",

"t2" -> "B"

),

fields = Map(

"f1" -> 10.3, // BigDecimal field

"f2" -> "x", // String field

"f3" -> -1, // Long field

"f4" -> true) // Boolean field

)



db.write(point)

} // Close connection here

def close(errorOrNull: Throwable): Unit = { }

}

and then include it in the writer as follows

val influxWriter = new influxDBSink("dbName") val influxQuery = ifIndicatorData

.writeStream

.foreach(influxWriter)

.outputMode("append")

.start()

Visualization

Once data is stored visualizations using various tools such as Grafana, Chronograph etc can be drawn. A sample visualization will be something like this.

influxdata.com

There are many articles available on Medium and other platforms as well regarding the visualizations, hence I am not touching it in detail.

Conclusion

In conclusion, I found Influxdb to be highly efficient in data storage and very easy to use. The compaction algorithms of Influxdb are very powerful and compress data to almost half of it. In my data itself, I have seen compression resulting in a reduction from around 67GB to 35GB.

However, what exactly will determine the scale and effect of compression is outside the scope of this article.