Neural networks for the high seas

Thanks to their flexible nature, neural networks and deep learning have transformed data science. Briton Park explains how to forecast oceanic temperatures by designing, training, and evaluating a neural network model with Eclipse Deeplearning4j. This tutorial presents a proof of concept, demonstrating the flexibility of neural networks and their potential to impact a variety of real-world problems.

Neural networks have achieved breakthrough accuracy in use cases as diverse as textual sentiment analysis, fraud detection, cybersecurity, image processing, and voice recognition. One of the main reasons for this is the wide variety of flexible neural network architectures that can be applied to any given problem. In this way, deep learning (as deep neural networks are called) has transformed data science: engineers apply their knowledge about a problem to the selection and design of model architectures, rather than to feature engineering.

For example, convolutional networks use convolutions and pooling to capture spatially local patterns (nearby pixels are more likely to be correlated than those far apart) and translational invariances (a cat is still a cat if you shift the image left by four pixels). Building these sorts of assumptions directly into the architecture enables convolutional networks to achieve state-of-the-art results on a variety of computer vision tasks, often with far fewer parameters.

Recurrent neural networks (RNNs), which have experienced similar success in natural language processing, add recurrent connections between hidden state units, so that the model’s prediction at any given moment depends on the past as well as the present. This enables RNNs to capture temporal patterns that can be difficult to detect with simpler models.

In this tutorial, we focus on the problem of forecasting ocean temperatures across a grid of locations. Like many problems in the physical world, this task exhibits a complex structure including both spatial correlation (nearby locations have similar temperatures) and temporal dynamics (temperatures change over time). We tackle this challenging problem by designing a hybrid architecture that includes both convolutional and recurrent components that can be trained in end-to-end fashion directly on the ocean temperature time series data.

We share the code to design, train, and evaluate this model using Eclipse Deeplearning4j (DL4J), as well as link to the data set and Zeppelin notebook with the complete tutorial. We briefly review key concepts from deep learning and DL4J. Both the DL4J website and its companion O’Reilly book, Deep Learning: A Practitioner’s Guide, provide a more comprehensive review.

Using an open-source framework such as DL4J can significantly accelerate the development of machine learning applications. Such frameworks typically solve problems such as integration with other frameworks, coordination of parallel hardware for the distributed training of algorithms, and machine learning model deployment. Rather than building their own machine learning stack from scratch, a project as complicated as creating an operating system, developers can go straight to building the application that will produce the predictions they need.

Forecasting task

The first step in any machine learning project involves formulating the prediction problem or task. We begin by informally stating the problem we want to solve and explaining any intuitions we might have. In this project, our aim is to model and predict the average daily ocean temperature at locations around the globe. Such a model has a wide range of applications. Accurate forecasts of next weekend’s coastal water temperatures can help local officials and businesses in beach communities plan for crowds. A properly designed model can also provide insights into physical phenomena, like extreme weather events and climate change.

Slightly more formally, we define a two-dimensional (2-D) 13-by-4 grid over a regional sea, such as the Bengal Sea, yielding 52 grid cells. At each grid location, we observe a sequence of daily mean ocean temperatures. Our task is to forecast tomorrow’s daily mean temperature at each location given a recent history of temperatures at all locations. As show in the figure below, our model will begin by reading the grid of temperatures for day 1 and predicting temperatures for day 2. It will then read day 2 and predict day 3, read day 3 and predict 4, and so on.

In this tutorial, we apply a variant of a convolutional long short-term memory (LSTM) RNN to this problem. As we explain in detail below, the convolutional architecture is well-suited to model the geospatial structure of the temperature grid, while the RNN can capture temporal correlations in sequences of variable length.

Data

Understanding and describing our data is a critical early step in machine learning. Our data consist of mean daily temperatures of the ocean from 1981 to 2017, originating from eight regional seas, including the Bengal, Mediterranean, Korean, Black, Bohai, Okhotsk, Arabian, and Japan seas. We focus on these areas because coastal areas contain richer variation in sea temperatures throughout the year, compared to the open ocean.

The original data are stored as CSV files, with one file for each combination of sea and year, ranging from 1981 to 2017. We further preprocess that data by extracting non-overlapping subsequences of 50 days from each sea, placing each subsequence in a separate, numbered CSV file. As a result, each file contains 50 contiguous days’ worth of temperatures from a single location. Otherwise, we discard information about exact time or originating sea.

The preprocessed data (available here) are organized into two directories, features and targets. “Features” is machine learning jargon for model input, while “targets” refer to the model’s expected output during training (targets are often referred to as labels in classification or as dependent variables in statistics). Each directory contains 2,089 CSV files with filenames 1.csv to 2089.csv. The feature sequences and the corresponding target sequences have the same file names, correspond to the same locations in the ocean, and both contain 51 lines: a header and 50 days of temperature grids. The fourth line (excluding the header) of a feature file contains temperatures from day 4. The fourth line of a target file contains temperatures from day 5, which we want to predict having observed temperatures through day 4. We will frequently refer to lines in the CSV file as “time steps” (common terminology when working with time series data).

Each line in the CSV file has 52 fields corresponding to the 52 cells in the temperature grid. These fields constitute a “vector” (1-D list of numerical values) with 52 elements. The grid cells appear in this vector in column-major order (cells in the first column occupy the first 13 elements, cells in the second column occupy the next 13 elements, etc.). If we append all 50 time steps from the CSV, we get a 50-by-52 2-D array or “matrix.” Finally, if we reshape each vector back into a grid, we get a 13-by-4-by-50 3-D array or “tensor.” This as similar to an RGB image with three dimensions (height, width, color channel) except here our dimensions represent relative latitude and longitude and time.

Convolutional LSTM overview

After we have formulated our prediction task and described our data, our next step is to specify our model, or in the case of deep learning, our neural network architecture. We plan to use a variant of a convolutional LSTM, which we briefly describe here.

Convolutional networks are based on the convolution operation. It preserves spatial relationships by applying the same filtering operation to each location in order within a raw signal, such as sliding a box-shaped filter over a row of pixels from left to right. We treat our grid-structured temperature data like 2-D images: to each grid cell, we apply a 2-D discrete convolution that consists of taking a dot product between a weight matrix and a small window around that location. The output of the filter is a scalar value for each location, indicating the filter’s “response” at each location. During training, the weights in the kernel are optimized to detect relevant spatial patterns over a small region, such as an elevated average temperature or a sharp change in temperature between neighboring locations in, e.g, the Mediterranean Sea. After the convolution, we apply a nonlinear activation function, a rectified linear unit in our case.

An LSTM is a variant of a recurrent layer (henceforth referred to as an RNN, which can refer to either the layer itself or any neural network that includes a recurrent layer). Like most neural network layers, RNN’s include hidden units whose activations result from multiplying a weight matrix times a vector of inputs, followed by element-wise application of an activation. Unlike hidden units in a standard feedforward neural network, hidden units in an RNN also receive input from hidden units from past time steps. To make this concrete with a simple example, an RNN estimating the temperature in the Black Sea on day 3 might have two inputs: the value of the hidden state on day 1 and the raw temperature on day 2. Thus, the recurrent neural network uses information from both the past and the present. The LSTM is a more complex RNN designed to address problems that arise when training RNNs, specifically the vanishing gradient problem.

A convolutional LSTM network combines aspects of both convolutional and LSTM networks. Our network architecture is a simplified version of the model described in this NIPS 2015 paper on precipitation nowcasting, with only one variable measured per grid cell and no convolutions applied to the hidden states. The overall architecture is shown in the figure below.

At any given time step, the network accepts two inputs: the grid of current temperatures (x in the figure) and a vector of network hidden states (h in the figure) from the previous time step. We process the grid with one or more convolutional filters and flatten the output. We then pass both this flattened output and the previous hidden states to an LSTM RNN layer, which updates its gate functions and its internal state (c’ in the figure). Finally, the LSTM emits an output (h’ in the figure), which is then reshaped into a grid and used both to predict temperatures at the next step and as an input at the next time step (h in the figure).

SEE ALSO: How to migrate TensorFlow into Deeplearning4j

Why a convolutional LSTM?

A convolutional structure is appropriate for this task due to the nature of the data. Heat dissipates through convection, meaning that temperatures across the ocean will tend to be “smooth” (i.e., temperatures of nearby grid cells will be similar). Thus, if neighboring cells have a high (or low) temperature, then a given cell is likely to have a high (or low) temperature as well. A convolutional network is likely to capture this local correlational structure.

On the other hand, an LSTM RNN structure is also appropriate because of the presence of short- and long-term temporal dependencies. For example, sea temperatures are unlikely to change drastically on a daily basis but rather follow a trend over days or weeks (short-to-medium-term dependencies). In addition, ocean temperatures also follow a seasonal pattern (long-term dependency): year to year, a single location is likely to follow a similar pattern of warmer and colder seasons over the course of the year. Note that our preprocessing (which generated sequences that are 50 days long) would have to be modified to allow our network to capture this type of seasonality. Specifically, we would have to use longer sequences covering multiple years.

Because of these two properties of the data, namely spatial and temporal dependencies, a convolutional LSTM structure is well-suited to this problem and data.

Code

Now that we have completed our preparatory steps (problem formulation, data description, architecture design), we are ready to do begin modeling! The full code that extracts the 50-day subsequences, performs vectorization, and builds and trains the neural network is available in a Zeppelin notebook using Scala. In the following sections, we will guide you through the code.

ETL and vectorization

Before we get to the model, we first need to write some code to to transform our data into a multidimensional numerical format that a neural network can read, i.e. n-dimensional arrays, also known as NDArrays or tensors. This process has much in common with traditional “extract, transform, and load” (ETL) from databases, so it is often referred to as “machine learning ETL.” It is commonly and perhaps more precisely referred to as “vectorization.” To accomplish this, we apply tools from the open source Eclipse DataVec suite, a full featured machine learning ETL and vectorization suite associated with DL4J.

Recall that our data is contained in CSV files, each of which contains 50 days of mean temperatures at 52 locations on a 2-D geospatial grid. The CSV file stores this as 50 rows (days) with 52 columns (location). The target sequences are contained in separate CSV files with similar structure. Our vectorization code is below.

val trainFeatures = new CSVSequenceRecordReader(numSkipLines, “,”); trainFeatures.initialize( new NumberedFileInputSplit(featureBaseDir + “%d.csv”, 1, 1936)); val trainTargets = new CSVSequenceRecordReader(numSkipLines, “,”); trainTargets.initialize( new NumberedFileInputSplit(targetBaseDir + “%d.csv”, 1, 1936)); val train = new SequenceRecordReaderDataSetIterator(trainFeatures, trainTargets, batchSize, 10, regression, SequenceRecordReaderDataSetIterator.AlignmentMode.EQUAL_LENGTH);

To process these CSV files, we begin with a RecordReader. RecordReaders are used to parse raw data into a structured record-like format (elements indexed by a unique id). DataVec offers a variety of record readers tailored to storage formats used commonly in machine learning, e.g., CSV and SVMLight, and Hadoop, e.g., MapFiles. It is also straightforward to implement the RecordReader interface for other file formats. Because our records are in fact sequences stored in CSV format (one sequence per file), we use the CSVSequenceRecordReader.

DL4J neural networks do not accept records but rather DataSets, which collect features and targets as NDArrays and provide convenient methods for accessing and manipulating them. To convert records into DataSets, we use a RecordReaderDataSetIterator. In DL4J, a DataSetIterator is responsible for traversing a collection of DataSet objects and providing them to, e.g., a neural network, during training or evaluation. DataSetIterators provide methods for returning batches of examples (represented as DataSets) and on-the-fly preprocessing, among other things. This tutorial illustrates the most common DL4J vectorization pattern: using a RecordReader in combination with a RecordReaderDataSetIterator. However, DataSetIterators can be used without record readers. See the DL4J machine learning ETL and vectorization guide for more information.

As shown below, we create two CSVSequenceRecordReaders, one each for the inputs and targets, respectively. The code below shows how to do this for the training data split, which we define to include files 1-1936, covering the years 1981-2014.

Since each pair of feature and target sequences has an equal number of time steps, we pass the AlignmentMode.EQUAL_LENGTH flag (see this post for an example of what to do if you have feature and target sequences of different length, such as in time series classification). Once the DataSetIterator is created, we are ready to configure and train our neural network.

Designing the neural network

We configure our DL4J neural network architecture using the NeuralNetConfiguration class, which provides a builder API via the public inner Builder class. Using this builder, we can specify our optimization algorithm (nearly always stochastic gradient descent), an optional custom updater like ADAM, the number and type of hidden layers, and other hyperparameters, such as the learning rate, activation functions, etc.

Critically, before adding any layers, we must first call the list() method to indicate that we are building a multilayer network. A multilayer network has a simple directed graph structure with one path through the layers; we can specify more general architectures with branching by calling graphBuilder(). Calling build() returns a MultiLayerConfiguration for a multilayer neural network, as in the code below.

val conf = new NeuralNetConfiguration.Builder() .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT) .seed(12345) .weightInit(WeightInit.XAVIER) .list() .layer(0, new ConvolutionLayer.Builder(kernelSize, kernelSize) .nIn(1) //1 channel .nOut(7) .stride(2, 2) .learningRate(0.005) .activation(Activation.RELU) .build()) .layer(1, new GravesLSTM.Builder() .activation(Activation.SOFTSIGN) .nIn(84) .nOut(200) .learningRate(0.0005) .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue) .gradientNormalizationThreshold(10) .build()) .layer(2, new RnnOutputLayer.Builder(LossFunction.MSE) .activation(Activation.IDENTITY) .nIn(200) .learningRate(0.0005) .nOut(52) .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue) .gradientNormalizationThreshold(10) .build()) .inputPreProcessor(0, new RnnToCnnPreProcessor(V_HEIGHT, V_WIDTH, numChannels)) .inputPreProcessor(1, new CnnToRnnPreProcessor(6, 2, 7 )) .pretrain(false).backprop(true) .build();

We use the configuration builder API to add two hidden layers and one output layer. The first is a 2-D convolutional layer whose filter size is determined by the variable kernelSize. Because it is our first layer, we must define the size of our input, specifically the number of input channels (one, because our temperature grid has only two dimensions) and the number of output filters. Note that it is not necessary to set the width and height of the input. The stride of two means that the filter will be applied to every other grid cell. Finally, we use a rectified linear unit activation function (nonlinearity). We want to emphasize that this is a 2-D spatial convolution applied at each time step independently; there is no convolution over the sequence.

The next layer is a Graves LSTM RNN with 200 hidden units and using a softsign activation function. The final layer is an RNNOutputLayer with 52 outputs, one per temperature grid cell. DL4J OutputLayers combine the functionality of a basic dense layer (weights and an activation function) with a loss function (and thus is equivalent to a DenseLayer, followed by a LossLayer). The RNNOutputLayer is an output layer that expects a sequential (rank 3) input and also emits a sequential output. Because we are predicting a continuous value (temperature), we do not use a nonlinear activation function (identity). For our loss function, we use mean squared error, a traditional loss used for regression tasks.

There are several other things to note about this network configuration. For one, DL4J enables the user to define many hyperparameters, such as learning rate or weight initialization, for both the entire model and individual layers (layer settings override model settings).

In this example, we use Xavier weight initializations for the entire model but set a separate learning rate for each layer (though we use the same value for each). We also add regularization (gradient clipping to prevent gradients from growing too large during backpropagation through time) for the LSTM and output layers.

Finally, we observe that when reading our data from CSV files, we get sequences of vectors (with 52 elements), but our convolutional layer expects sequences of 13-by-4 grids. Thus, we need to add a RnnToCnnPreProcessor for the first layer that reshapes each vector into a grid before applying the convolutional layer. Likewise, use a CnnToRnnPreProcessor to flatten the output from the convolutional layer before passing it to the LSTM.

After building our neural network configuration, we initialize a neural network by passing the configuration to the MultiLayerNetwork constructor and then calling the init() method, as below.

val net = MultiLayerNetwork(conf); net.init();

Training the neural network

It is now time to train our new neural network. Training for this forecasting task is straightforward: we define a for loop with a fixed number of epochs (complete passes through the entire data set), calling fit on our training data iterator each time. Note that it is necessary to call reset() on the iterator at the end of each iteration.

for(epoch <- 1 to 25) { println(“Epoch ” + epoch); net.fit(train); train.reset(); }

This is the simplest possible training loop with no form of monitoring or sophisticated model selection. The official DL4J documentation and examples repository provide many examples of how to visualize and debug neural networks using the DL4J training UI, use early stopping to prevent overfitting, add listeners to monitor training, and save model checkpoints.

Evaluating the neural network

Once our model is trained, we want to evaluate it on a held out test set. We will not get into detail about the importance of a proper training/test split here (for that, we recommend the excellent discussion in Deep Learning: A Practitioner’s Guide). Suffice it to say that it is critical to evaluate a model on data that it did not see during training. When working with time-sensitive data, we should nearly always train on the past and test on the future, mimicking the way the model would likely be used in practice. Splitting our data without taking time in account is likely to produce misleading accuracy as a result of so-called “future leakage”, where a network makes predictions about one moment based on knowledge of subsequent moments, a circumstance which models never encounter outside of training.

DL4J defines a variety of tools and classes for evaluating prediction performance on a number of tasks (multiclass and binary classification, regression, etc.). Here, our task is regression, so we use the RegressionEvaluation class. After initializing our regression evaluator, we can loop through the test set iterator and use the evalTimeSeries method. At the end, we can simply print out the accumulated statistics for metrics including mean squared error, mean absolute error, and correlation coefficient.

The code below shows how to set up the test set record readers and iterator, create a RegressionEvaluation object, and then apply it to the trained model and test set.

val testFeatures = new CSVSequenceRecordReader(numSkipLines, “,”); testFeatures.initialize( new NumberedFileInputSplit(featureBaseDir + “%d.csv”, 1937, 2089)); val testTargets = new CSVSequenceRecordReader(numSkipLines, “,”); testTargets.initialize( new NumberedFileInputSplit(targetBaseDir + “%d.csv”, 1937, 2089)); val test = new SequenceRecordReaderDataSetIterator(testFeatures, testTargets, batchSize, 10, regression, SequenceRecordReaderDataSetIterator.AlignmentMode.EQUAL_LENGTH); val eval = net.evaluateRegression(test); test.reset(); println(eval.stats());

In the figure below, we show the test set accuracy for a handful of columns. We can see that the errors in temperature predictions of points on the grid are correlated with the values of their neighbors. For example, points on the top left edge of the grid appear to have higher errors with the rest of the points shown below, which are closer to the center of the sea. We expect to see these kinds of correlations in the model errors because of the spatial dependencies previously noted. We also observe that the convolutional LSTM outperforms simple linear autoregressive models by large margins, with a mean square error that is typically 20-25% lower. This suggests that the complex spatial and temporal interactions captured by the neural net (but not by the linear model) provide predictive power.

Conclusion

We have shown how to use Eclipse DL4J to build a neural network for forecasting sea temperatures across a large geographic region. In doing so, we demonstrated a standard machine learning workflow that began with formulating the prediction task, moved on to vectorization and training, and ended with evaluating predictive accuracy on a held-out test set. When architecting our neural network, we added convolutional and recurrent components designed to take advantage of two important properties of the data: spatial and temporal correlations.

Despite recent high profile successes and the fact that millions of people on a daily basis use products built around machine learning (for example, speech recognition on mobile phones), its impact on our lives remains relatively narrow. One of the major frontiers in machine learning is achieving similar success in high-impact domains, such as healthcare and climate sciences. This tutorial presents a proof of concept, demonstrating the flexibility of neural networks and their potential to impact a variety of real-world problems.

The availability of open-source, well-documented machine learning frameworks makes training models easier than ever. However, training models is not an end in and of itself; our real objective is to deploy and use these models to help us make decisions in the world. The relative ease of training predictive models belies the difficulty of deploying them. Integrating predictive models with other software introduces engineering challenges that differ from those encountered during training. Deployed models often run on different hardware; e.g., they may perform inference on a smartphone vs. training on a cluster of GPU servers, and must satisfy requirements for latency and availability.

As we consider the future of machine learning, the fatal accident involving an Uber automated vehicle raises important questions, such as: How do we decide whether the model is ready for deployment? If our test set is outdated or otherwise fails to reflect the real world, we may discover that our model underperforms once deployed. Timely detection of underperforming models requires real-time monitoring of not just uptime but also prediction accuracy. Further, through soft deployments and controlled experiments such as A/B tests, we may be able to catch a flawed model before taking it live. Few, if any, open source tools provide this sort of functionality out of the box.

This is usually the moment in the article where you would read a sentence about the risks of machine learning looming as large as its potential, an evocation of promise tempered by an admonition to be cautious. But that’s not how we feel about it. Yes, machine is hard. Sure, getting it right requires a great deal of effort and diligence. But we need to get it right, because with machine learning, we are able to make smarter decisions, and as a species, we desperately need to do that. And the decisions we make about data only matter when they influence are behavior, which means we need to take machine learning out of the lab and into production environments as efficiently and safely as possible. Best of luck!