1: val amaznDf = spark 2: .read 3: .option("header", "true") 4: .csv("/tmp/amzn.csv") 5: val amazn = amaznDf.select(amaznDf("Date").as("amznDate"), amaznDf("Close").as("closeAmazn")) 6: 7: val googDf: DataFrame = spark 8: .read 9: .option("header", "true") 10: .csv("/tmp/goog.csv") 11: val goog = googDf.select(googDf("Date").as("googDate"), googDf("Close").as("closeGoog")) 12: 13: val yhooDf: DataFrame = spark 14: .read 15: .option("header", "true") 16: .csv("/tmp/yhoo.csv") 17: val yhoo = yhooDf.select(yhooDf("Date").as("yhooDate"), yhooDf("Close").as("closeYhoo"))

1: val data = amazn 2: .join(goog, $"amznDate" === $"googDate").select($"amznDate", $"closeAmazn", $"closeGoog") 3: .join(yhoo, $"amznDate" === $"yhooDate").select($"amznDate".as("date"), $"closeAmazn", $"closeGoog", $"closeYhoo")





Since we need to create a time series out of the above dataframe, we need to modify the above dataframe to the below desired dataframe.





The benefit we are getting in converting data to above format is that we can always add more symbols and the structure will still remain same.

Here, data is our timestamp, symbol is the key and price is the value.





Below is the code to convert the DataFrame obtained in step 2 to the above desired form.





1: val formattedData = data 2: .flatMap{ 3: row => 4: Array( 5: (row.getString(row.fieldIndex("date")), "amzn", row.getString(row.fieldIndex("closeAmazn"))), 6: (row.getString(row.fieldIndex("date")), "goog", row.getString(row.fieldIndex("closeGoog"))), 7: (row.getString(row.fieldIndex("date")), "yhoo", row.getString(row.fieldIndex("closeYhoo"))) 8: ) 9: }.toDF("date","symbol","closingPrice") 10: 11: 12: val finalDf = formattedData 13: .withColumn("timestamp",toTime(formattedData("date"))) 14: .withColumn("price", formattedData("closingPrice").cast(DoubleType)) 15: .drop("date","closingPrice").sort("timestamp") 16: finalDf.registerTempTable("preData")

1: val minDate = finalDf.selectExpr("min(timestamp)").collect()(0).getTimestamp(0) 2: val maxDate = finalDf.selectExpr("max(timestamp)").collect()(0).getTimestamp(0) 3: 4: val zone = ZoneId.systemDefault() 5: 6: val dtIndex = DateTimeIndex.uniformFromInterval( 7: ZonedDateTime.of(minDate.toLocalDateTime, zone), 8: ZonedDateTime.of(maxDate.toLocalDateTime, zone), 9: new DayFrequency(1) 10: ) 11: 12: val tsRdd = TimeSeriesRDD.timeSeriesRDDFromObservations(dtIndex, finalDf, "timestamp", "symbol", "price")

1: val df = tsRdd.mapSeries{vector => { 2: val newVec = new org.apache.spark.mllib.linalg.DenseVector(vector.toArray.map(x => if(x.equals(Double.NaN)) 0 else x)) 3: val arimaModel = ARIMA.fitModel(1, 0, 0, newVec) 4: val forecasted = arimaModel.forecast(newVec, DAYS) 5: new org.apache.spark.mllib.linalg.DenseVector(forecasted.toArray.slice(forecasted.size-(DAYS+1), forecasted.size-1)) 6: }}.toDF("symbol","values") 7: df.registerTempTable("data")

Apache Spark has become one of the most powerful framework for big data processing because of its in-memory computing capabilities. Most of the organisations are moving their ETL workflows as well as their Machine Learning workflows to Apache Spark Engine. But there is not enough light in the area of Time-Series processing with Apache Spark.In this post We will see how we can make use of spark-ts package for time-series forecasting.The problem I have chosen is of Stock Price prediction as it is very generic and easy to understand for beginners.So the problem is straight forward, We have closing stock price for Google(goog), Yahoo(yhoo) and Amazon(amzn) of last one year and we want to predict the price for certain number of days.The data is available free and is downloaded from Yahoo Finance in the form of csv files.Lets get started...1. First, We will create the DataFrames for each of the csv files we have downloaded.2. Next, lets join this data on date and create one DataFrame3. Now we have get the data in the below format.We can plot the price vs timestamp plot to see how the stock price has varied for the particular symbol. Below is the image showing the same for Amazon(amzn):4. Since we have the data ready in the desired format, we will now make use of spark-ts package for forecasting purpose. Before applying any forecasting algorithm to the above data we need to create a TimeSeriesRDD. TimeSeriesRDD is like any other Spark RDD with a specific schema on which we perform operations distributedly.spark-ts package provides TimeSeriesRDD object with which we can get our TimeSeriesRDD using the above dataframe as shown in the code below.5. Next, We will use ARIMA for forecasting the price of next 5 days. spark-ts package provides various algorithms for time series implementation including AR,MA,ARIMA,HoltWinters etc.Below is our forecasted dataframe,You can go through the this blog to learn more about spark-ts package.In this blog, we focussed more on how we can leverage Apache Spark computing abilities for time-series analysis. Please share your thoughts and questions via comments.