Realtime predictions with Apache Spark/Pyspark and Python

There are many blogs that talk about Apache spark and how scalable it is to build Machine Learning models using Big data. But, there are few blogs that talk about Apache Spark and real time predictions and in this post we will be talking about real time predictions.

What is Apache Spark?

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It provides a scalable way run computions in a distributed environment.

Supports Scalable Data Processing (RDD, DataFrames)

Supports building Machine Learning Model through MLLib

Supports Streaming workload through Spark Streaming

Supports Graph Processing using Graphx

Importance of Realtime predictions

Apache spark is a great when it comes to data munging, distributed data processing, building models and batch inference. Since the library needs to supports cluster management, worker co-ordination in a distributed system it comes with an overhead and not a good fit for including in the realtime prediction pipeline.

Realtime Predictions use case

Fraud: Let say in realtime we have a use case to detect if card swipe or the order was fraudulent. For this feature to be useful we are talking about sub-millisecond latency.

Recommendations: Ad click predictions

For the sake of simplicity, we will be using python and build a simple GBTRegressor where the training in done in Pyspark and the predictions are done in Realtime using Mleap

What is Mleap?

Mleap is a library that allows real-time workloads for a model built in Apache Spark. Think of it as Apache Spark without all the cluster management and worker coordination overhead.

Mleap provides a one-to-one mapping between the Transformer and Estimator provided by Apache Spark. So if you can build a model using existing transformers you can export the model and do real-time predictions.

We will be working on the diamonds dataset and try to predict the price of the diamond. However, we will not concentrate on the model accuracy or anything just the part of Build -> Export -> Real-time Predict.

Diamond Dataset

We need to add mleap packages to pyspark so that we can export the model and the pipeline as a mleap bundle

pyspark --packages ml.combust.mleap:mleap-spark_2.11:0.10.0 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Python version 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018 19:50:54) SparkSession available as 'spark'. from pyspark.ml.feature import * from pyspark.ml import Pipeline from pyspark.ml.regression import * from pyspark.ml.evaluation import * import mleap.pyspark from mleap.pyspark.spark_support import SimpleSparkSerializer path = 'diamonds.csv' allFields = spark.read.option("header", True).option("inferSchema", True).csv(path).drop("_c0") 2018-10-19 14:58:35 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException categoricalFields = ["cut", "color", "clarity"] indexers = [StringIndexer(inputCol=f, outputCol=f + "Index") for f in categoricalFields] assembler = VectorAssembler( \ ... inputCols=[f + "Index" for f in categoricalFields] + ["carat", "depth", "table", "x", "y", "z"], \ ... outputCol="features") gbt = GBTRegressor(labelCol="price") gbtPipeline = Pipeline(stages=indexers + [assembler, gbt]) train, test = allFields.randomSplit([0.8, 0.2]) gbtModel = gbtPipeline.fit(train) predictions = gbtModel.transform(test) predictions.show(2) 2018-10-19 14:58:59 WARN BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 2018-10-19 14:58:59 WARN BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS +-----+-----+-----+-------+-----+-----+-----+----+----+----+--------+----------+------------+--------------------+-----------------+ |carat| cut|color|clarity|depth|table|price| x| y| z|cutIndex|colorIndex|clarityIndex| features| prediction| +-----+-----+-----+-------+-----+-----+-----+----+----+----+--------+----------+------------+--------------------+-----------------+ | 0.2|Ideal| D| VS2| 61.5| 57.0| 367|3.81|3.77|2.33| 0.0| 4.0| 1.0|[0.0,4.0,1.0,0.2,...|717.4607731780242| | 0.2|Ideal| E| VS2| 59.7| 55.0| 367|3.86|3.84| 2.3| 0.0| 1.0| 1.0|[0.0,1.0,1.0,0.2,...|688.2446556306745| +-----+-----+-----+-------+-----+-----+-----+----+----+----+--------+----------+------------+--------------------+-----------------+ only showing top 2 rows

Now that we have the model its time to serialize and export the model

gbtModel.serializeToBundle("jar:file:/tmp/new1_diamonds.zip", predictions.limit(0)) # Generate a sample dataset to test against real time predictions small_test = test.drop("price").sample(False, 0.002) schema = [ { "name" : field.simpleString().split(":")[0], "type" : field.simpleString().split(":")[1] } for field in small_test.schema ] rows = [[field for field in row] for row in small_test.collect()] rows [[0.29, 'Very Good', 'E', 'VS1', 62.8, 44.0, 4.2, 4.24, 2.65], [0.3, 'Premium', 'H', 'VS1', 62.5, 58.0, 4.34, 4.27, 2.69], [0.37, 'Very Good', 'E', 'VS1', 62.0, 56.0, 4.59, 4.63, 2.86] ..... ]

Using the data above lets generate a sample payload for real time predictions. One key point to note is that we need to pass the necessary schema to mleap runtime to make it work.

import json leapframe = { "schema" : { "fields" : schema }, "rows" : rows } json.dumps(leapframe) '{"schema": {"fields": [{"name": "carat", "type": "double"}, {"name": "cut", "type": "string"}, {"name": "color", "type": "string"}, {"name": "clarity", "type": "string"}, {"name": "depth", "type": "double"}, {"name": "table", "type": "double"}, {"name": "x", "type": "double"}, {"name": "y", "type": "double"}, {"name": "z", "type": "double"}]}, "rows": [[0.29, "Very Good", "E", "VS1", 62.8, 44.0, 4.2, 4.24, 2.65], [0.3, "Premium", "H", "VS1", 62.5, 58.0, 4.34, 4.27, 2.69], [0.37, "Very Good", "E", "VS1", 62.0, 56.0, 4.59, 4.63, 2.86], [0.42, "Very Good", "F", "VS1", 59.9, 61.0, 4.84, 4.88, 2.91], [0.5, "Premium", "G", "SI1", 62.9, 59.0, 5.09, 5.06, 3.19], [0.51, "Ideal", "F", "VS1", 61.7, 55.0, 5.14, 5.17, 3.18], [0.51, "Very Good", "G", "VS1", 62.7, 56.0, 5.08, 5.13, 3.2], [0.52, "Very Good", "D", "VVS2", 59.8, 55.0, 5.28, 5.35, 3.18], [0.55, "Ideal", "I", "VVS1", 61.7, 56.0, 5.24, 5.29, 3.25], [0.6, "Very Good", "G", "SI1", 63.3, 58.0, 5.34, 5.31, 3.37], [0.65, "Premium", "F", "VS2", 59.0, 62.0, 5.68, 5.65, 3.34], [0.71, "Ideal", "J", "SI1", 61.0, 57.0, 5.78, 5.8, 3.53], [0.71, "Very Good", "E", "SI2", 62.2, 58.0, 5.68, 5.74, 3.55], [0.72, "Good", "F", "SI1", 63.6, 55.0, 5.71, 5.64, 3.61], [0.72, "Ideal", "J", "SI1", 62.6, 56.0, 5.67, 5.74, 3.57], [0.73, "Ideal", "G", "VVS2", 61.9, 56.0, 5.74, 5.81, 3.61], [1.0, "Good", "J", "SI2", 57.8, 61.0, 6.54, 6.58, 3.79], [1.01, "Good", "F", "SI2", 63.6, 59.0, 6.37, 6.34, 4.04], [1.01, "Premium", "D", "SI1", 62.2, 53.0, 6.47, 6.43, 4.01], [1.01, "Very Good", "G", "SI1", 63.5, 60.0, 6.41, 6.38, 3.92], [1.14, "Very Good", "G", "SI2", 59.5, 59.0, 6.79, 6.85, 4.06], [1.22, "Premium", "I", "SI1", 62.1, 58.0, 6.89, 6.8, 4.25], [1.7, "Premium", "I", "VS2", 60.5, 61.0, 7.68, 7.65, 4.64], [2.05, "Premium", "H", "SI1", 60.2, 58.0, 8.25, 8.22, 4.96]]}'

Store the above serialized data into `data.json` file

Docker & Mleap Runtime.

To generate real time predictions we need to load our trained model into the runtime provided by Mleap. Make sure you have installed docker locally.