Multivariate time series forecasting is one of the most commonly encountered problems with various applications such as weather forecasting, stock price prediction, real estate predictions. Recently, deep learning techniques have been applied to solve this class of problems. In this blog, I will show how Apache MXNet R package (MXNet-R) can be used can be used to model and solve time series forecasting problems. MXNet-R is a binding of Apache MXNet deep learning back-end with the R language as a front-end. MXNet-R allows R enthusiasts to leverage deep learning in their research projects and applications, staying within the R eco-system, while leveraging the powerful Apache MXNet deep learning framework.

In this blog, we will train a model to learn a time series (linear combination of sine waves) to model a multi-dimensional synthetic dataset. With this dataset we will demonstrate how you can use MXNet-R to train on the data to create a model, and then use the model to forecast time series. The code below has been tested on MacOS, running on CPU with MXNet 1.3.0.

To get started, visit the MXNet install page and install the latest MXNet-R package based on your system configuration. Currently, MXNet-R supports Linux, Windows and OSX.

The next step requires to create synthetic datasets and then we will discuss steps required to achieve our goal.

Full code for the task can be found in gist.

Dataset Creation

We will create a dataset where each dimension will be a time series sine wave and the output will be linear combination of these six time series data. We will try to predict the output based on previous time series data for the output value.

Time series for each dimension

Pre-processing steps

We start with loading R libraries required for the task, followed by cleaning and normalizing the data. Please install the required packages, if not already installed, using the R console.

library("readr")

library("dplyr")

library("mxnet")

library("abind")

Now we will start with creating the synthetic datasets. We will create six sine wave time series and the seventh time series will be linear combination of previous six to create a multidimensional dataset where each of the previous sine wave serves as one dimension.

# generating datasets

len<-44000

df <- data.frame(dim1= numeric(len),dim2= numeric(len),dim3= numeric(len),dim4= numeric(len),dim5= numeric(len),dim6= numeric(len),dim7= numeric(len))

df$dim1 <- sin(pi/12 * (1:(len)))

df$dim2 <- sin(5 + pi/6 * (1:(len)))

df$dim3 <- sin(15 + pi/8 * (1:(len)))

df$dim4 <- sin(55 + pi/10 * (1:(len)))

df$dim5 <- sin(2 + pi/16 * (1:(len)))

df$dim6 <- sin(1+ pi/7 * (1:(len)))

df$dim7 <- 7*df$dim1+ 5*df$dim2+ 17*df$dim3+ 25*df$dim4 + 19*df$dim5+ 21*df$dim6

We also normalize each of the feature set to a range [0,1] as normalizing helps in getting rid of noise and network learns better.

# Normalizing features

df <- matrix(as.matrix(df),

ncol = ncol(df),

dimnames = NULL) rangenorm <- function(x) {

(x - min(x))/(max(x) - min(x))

}

df <- apply(df, 2, rangenorm)

df <- t(df)

For using multi-dimensional data with MXNet-R, we need to convert training data to the form n_dim x seq_len x num_samples. For one-to-one RNN flavors, labels should be of the form seq_len x num_samples, while for many-to-one flavor, the labels should be of the form (1 x num_samples). Please note that MXNet-R currently supports only these two flavors of RNN. One-to-one RNNs has one output at each timestep i.e. length of input sequence and output sequence are equal (ex. Part of Speech tagging) whereas many-to-one RNNs have one output for a sequence (ex. sentiment analysis for a sequence of words).

Different flavors of RNNs supported by MXNet-R [Diagram Source]

We will use n_dim = 7, seq_len = 100, and num_samples = 430 because the dataset has 430 samples, each the length of 100 timestamps. We have 7 time series as input features so each input has dimension of seven at each time step.

n_dim <- 7

seq_len <- 100

num_samples <- 430

The next step shows how to format data and labels in the form required by the RNNs in MXNet-R. We will extract only required data and the labels.

# extract only required data from dataset

trX <- df[1:n_dim, 1:(seq_len * num_samples)] # the label data(next output value) should be one time

# step ahead of the current output value

trY <- df[7, 2:(1+(seq_len * num_samples))]

We will reshape the matrices to convert into the input form required by the RNN network.

trainX <- trX

dim(trainX) <- c(n_dim, seq_len, num_samples)

trainY <- trY

dim(trainY) <- c(seq_len, num_samples)

Training the network

Once we have the formatted dataset, we will create the neural network, using symbolic programming, and will then train the newly created network.

We start with setting up samples. We will take the first 300 samples for training and the next 100 samples for evaluation. We will also keep aside few samples for testing purposes.

batch.size <- 32

train_ids <- 1:300

eval_ids <- 301:400

The next step is to create data iterators for feeding into the RNN network. Data iterators in MXNet are iterator objects. In MXNet, data iterators return a batch of data as DataBatch on each call to next . A DataBatch often contains n training examples and their corresponding labels.

## create data iterators

train.data <- mx.io.arrayiter(data = trainX[,,train_ids, drop = F], label = trainY[, train_ids], batch.size = batch.size, shuffle = TRUE)



eval.data <- mx.io.arrayiter(data = trainX[,,eval_ids, drop = F], label = trainY[, eval_ids], batch.size = batch.size, shuffle = FALSE)

Now we will create the neural network using RNN symbols. We will set the number of RNN layers to be 2 and the number of hidden units to be 50 since the dataset at hand is quite small and simple and we don’t need a very complex model to learn it. We will use LSTM units to learn the sequences. Long Short-Term Memory (LSTMs) are RNNs used for learning long temporal sequences.

## Create the symbol for RNN

symbol <- rnn.graph(num_rnn_layer = 2,

num_hidden = 50,

input_size = NULL,

num_embed = NULL,

num_decode = 1,

masking = F,

loss_output = "linear",

dropout = 0.5,

ignore_label = -1,

cell_type = "lstm",

output_last_state = T,

config = "one-to-one")

Next step involves setting up the loss function. We will use mean squared error as our metric for loss and we can define it as shown below.

mx.metric.mse.seq <- mx.metric.custom("MSE", function(label, pred) {

label = mx.nd.reshape(label, shape = -1)

pred = mx.nd.reshape(pred, shape = -1)

res <- mx.nd.mean(mx.nd.square(label - pred))

return(as.array(res))

})

We will set up context, optimizer and initializer for the network and custom callback functions for each epoch. We used CPU as we were running on Mac. However, you MXNet-R also has GPU support for Ubuntu and Windows.

ctx <- mx.cpu() initializer <- mx.init.Xavier(rnd_type = "gaussian",

factor_type = "avg",

magnitude = 1) optimizer <- mx.opt.create("adadelta",

rho = 0.9,

eps = 1e-06,

wd = 1e-06,

clip_gradient = 1,

rescale.grad = 1/batch.size) logger <- mx.metric.logger()

epoch.end.callback <- mx.callback.log.train.metric(period = 10,

logger = logger)

Now, we are all set to start training the network. Training may take a while, depending on the hardware you are running on. It took me 234 seconds running on MacBook Pro with 2.3 GHz Intel Core i5.

## train the network

system.time(model <- mx.model.buckets(symbol = symbol,

train.data = train.data,

eval.data = eval.data,

num.round = 200,

ctx = ctx,

verbose = TRUE,

metric = mx.metric.mse.seq,

initializer = initializer,

optimizer = optimizer,

batch.end.callback = NULL,

epoch.end.callback= epoch.end.callback))

The output of the above code:

Start training with 1 devices

[1] Train-MSE=0.0942884422838688

[1] Validation-MSE=0.0346374846994877

[2] Train-MSE=0.0371618691831827

[2] Validation-MSE=0.0358746200799942

..........

..........

[199] Train-MSE=0.00409387513063848

[199] Validation-MSE=0.00288719875970855

[200] Train-MSE=0.00403149246703833

[200] Validation-MSE=0.00306809868197888

user system elapsed

234.265 20.535 134.364

Training vs Validation error

From the above plot, we can see how training as well as validation error decreased gradually and network was able to learn the time-series.

Predicting future values

Now we will use the trained network for inference.

We start with extracting state symbols for the trained model.

## Extracting the state symbols for RNN



internals <- model$symbol$get.internals() sym_state <- internals$get.output(which(internals$outputs %in% "RNN_state")) sym_state_cell <- internals$get.output(which(internals$outputs %in% "RNN_state_cell")) sym_output <- internals$get.output(which(internals$outputs %in% "loss_output")) symbol <- mx.symbol.Group(sym_output, sym_state, sym_state_cell)





We will try to predict the timeseries values of our first test sample, the 401st sample.

pred_length <- 100

predicted <- numeric()

We need to initialize the states of the RNN. Hence, we will pass the 400th sample through the network to get the values for RNN states and use it to start the prediction of the next 100 timestamps.

data <- mx.nd.array(trainX[, , 400, drop = F])

label <- mx.nd.array(trainY[, 400, drop = F])

We create data iterators for the input, please note that the label is required to create the iterator and will not be used in the inference. You can use dummy values too in the label.

infer.data <- mx.io.arrayiter(data = data,

label = label,

batch.size = 1,

shuffle = FALSE) infer <- mx.infer.rnn.one(infer.data = infer.data,

symbol = symbol,

arg.params = model$arg.params,

aux.params = model$aux.params,

input.params = NULL,

ctx = ctx)

We will now iterate time-step by timestep to generate each pollution value, starting at the beginning of the 401st sample. Here we use the RNN state values generated from the ground truth data of the previous timestep to do the next time-step prediction. The other way to predict the next time-step is by using the predicted values as inputs. This is known as “auto-regressive inference”, as we will see after this example.

actual <- trainY[, 401] ## Iterate one by one over timestamps

for (i in 1:pred_length) { data <- mx.nd.array(trainX[, i, 401, drop = F])

label <- mx.nd.array(trainY[i, 401, drop = F])

infer.data <- mx.io.arrayiter(data = data,

label = label,

batch.size = 1,

shuffle = FALSE)

## use previous RNN state values

infer <- mx.infer.rnn.one(infer.data = infer.data,

symbol = symbol,

ctx = ctx,

arg.params = model$arg.params,

aux.params = model$aux.params,

input.params =

list(rnn.state=infer[[2]],

rnn.state.cell = infer[[3]])) pred <- infer[[1]]

predicted <- c(predicted, as.numeric(as.array(pred))) }

predicted contains the predicted values whereas actual contains actual values.

Let’s try to visualize the prediction with respect to actual values. We generated 401st sample as discussed above and the results are given below.

Predicted and actual values for 401st time sample

The model is quite simple yet learns the the time series data pretty well.

We can also repeat the above experiment for the 401st time sample with auto-regressive inference. In this example, we will use the output from one time-step as input for next time-step. The initial setup for getting RN states would be same as above.

## setting up RNN states

pred_length <- 100

predicted <- numeric()

data <- mx.nd.array(trainX[, , 400, drop = F])

label <- mx.nd.array(trainY[, 400, drop = F])

infer.data <- mx.io.arrayiter(data = data,

label = label,

batch.size = 1,

shuffle = FALSE) infer <- mx.infer.rnn.one(infer.data = infer.data,

symbol = symbol,

arg.params = model$arg.params,

aux.params = model$aux.params,

input.params = NULL,

ctx = ctx)

However, the inference loop can be changed appropriately for the task as shown below.

pred <- as.numeric(as.array(infer[[1]])[1,100])

actual <- trainY[, 401] ## Iterate one by one over timestamps for (i in 1:pred_length) { data_auto <- mx.nd.array(trainX[, i, 401, drop = F])

m <- as.array(data_auto)

m[1,,] <- as.numeric(as.array(pred))

data_auto <- mx.nd.array(m)

label <- mx.nd.array(trainY[i, 401, drop = F])

infer.data <- mx.io.arrayiter(data = data_auto,

label = label,

batch.size = 1,

shuffle = FALSE)

## use previous RNN state values

infer <- mx.infer.rnn.one(infer.data = infer.data,

symbol = symbol,

ctx = ctx,

arg.params = model$arg.params,

aux.params = model$aux.params,

input.params =

list(rnn.state=infer[[2]],

rnn.state.cell = infer[[3]])) pred <- infer[[1]]

predicted <- c(predicted, as.numeric(as.array(pred))) }

The inference result looks as follows.

Predicted and actual values for 401st time sample using auto-regressive inference method

The auto-regressive inference does deviate from the actual predictions as expected but the model learns the seasonality and correlation with the covariate for the 100 next time steps, not bad!

You can improve further with more tweaking of hyper-parameters, and the network architecture using more complex state-of-the-art models like DeepAR and LSTNet. You can find the entire code for the above task in gist.

Call to Action

There’s a lot more you can do with MXNet and MXNet-R — head out to the tutorials to check it out! MXNet is a powerful deep learning framework with which you can save the trained models and use it with other MXNet language bindings and/or Model Server.

References