: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead inon line

Neural networks are one of the most popular and powerful classes of machine learning algorithms. In quantitative finance neural networks are often used for time-series forecasting, constructing proprietary indicators, algorithmic trading, securities classification and credit risk modelling. They have also been used to construct stochastic process models and price derivatives. Despite their usefulness neural networks tend to have a bad reputation because their performance is "temperamental". In my opinion this can be attributed to poor network design owing to misconceptions regarding how neural networks work. This article discusses some of those misconceptions.

1. Neural networks are not models of the human brain

The human brain is one of the great mysteries of our time and scientists have not reached a consensus on exactly how it works. Two theories of the brain exist namely the grandmother cell theory and the distributed representation theory. The first theory asserts that individual neurons have high information capacity and are capable of representing complex concepts such as your grandmother or even Jennifer Aniston. The second theory neurons asserts that neurons are much more simple and representations of complex objects are distributed across many neurons. Artificial neural networks are loosely inspired by the second theory.

One reason why I believe current generation neural networks are not capable of sentience (a different concept to intelligence) is because I believe that biological neurons are much more complex than artificial neurons.

Another big difference between the brain and neural networks is size and organization. Human brains contain many more neurons and synapses than neural network and they are self-organizing and adaptive. Neural networks, by comparison, are organized according to an architecture. Neural networks are not "self-organizing" in the same sense as the brain which much more closely resemble a graph than an ordered network.

So what does that mean? Think of it this way: a neural network is inspired by the brain in the same way that the Olympic stadium in Beijing is inspired by a bird's nest. That does not mean that the Olympic stadium is-a bird's nest, it means that some elements of birds nests are present in the design of the stadium. In other words, elements of the brain are present in the design of neural networks but they are a lot less similar than you might think.

In fact neural networks are more closely related to statistical methods such as curve fitting and regression analysis than the human brain. In the context of quantitative finance I think it is important to remember that because whilst it may sound cool to say that something is 'inspired by the brain', this statement may result unrealistic expectations or fear. For more info see 'No! Artificial Intelligence is not an existential threat'.

Back to the top

2. Neural networks aren't a "weak form" of statistics

Neural networks consist of layers of interconnected nodes. Individual nodes are called perceptrons and resemble a multiple linear regression. The difference between a multiple linear regression and a perceptron is that a perceptron feeds the signal generated by a multiple linear regression into an activation function which may or may not be non-linear. In a multi layered perceptron (MLP) perceptrons are arranged into layers and layers are connected with other another. In the MLP there are three types of layers namely, the input layer, hidden layer(s), and the output layer. The input layer receives input patterns and the output layer could contain a list of classifications or output signals to which those input patterns may map. Hidden layers adjust the weightings on those inputs until the error of the neural network is minimized. One interpretation of this is that the hidden layers extract salient features in the input data which have predictive power with respect to the outputs.

Mapping Inputs : Outputs

A perceptron receives a vector of inputs, , consisting on attributes. This vector of inputs is called an input pattern. These inputs are weighted according to the weight vector belonging to that perceptron, . In the context of multiple linear regression these can be thought of as regression co-efficients or beta's. The net input signal, , of the perceptron is usually the sum product of the input pattern and their weights. Neurons which use the sum-product for are called summation units.

The net input signal, minus a bias is then fed into some activation function, . Activation functions are usually monotonically increasing functions which are bounded between either or (this is discussed further on in this article). Activation functions can be linear or non-linear.

Some popular activation functions used in neural networks are shown below,

The simplest neural network is one which has just one neuron which maps inputs to an output. Given a pattern, , the objective of this network would be to minimize the error of the output signal, , relative to some known target value for some given training pattern, . For example, if the neuron was supposed to map to -1 but it mapped it to 1 then the error, as measured by sum-squared distance, of the neuron would be 4, .

Layering

As shown in the image above perceptrons are organized into layers. The first layer or perceptrons, called the input later, receives the patterns, , in the training set, . The last layer maps to the expected outputs for those patterns. An example of this is that the patterns may be a list of quantities for different technical indicators regarding a security and the potential outputs may be the categories .

A hidden layer is one which receives as inputs the outputs from another layer; and for which the outputs form the inputs into yet another layer. So what do these hidden layers do? One interpretation is that they extract salient features in the input data which have predictive power with respect to the outputs. This is called feature extraction and in a way it performs a similar function to statistical techniques such as principal component analysis.

Deep neural networks have a large number of hidden layers and are able to extract much deeper features from the data. Recently, deep neural networks have performed particularly well for image recognition problems. An illustration of feature extraction in the context of image recognition is shown below,

I think that one of the problems facing the use of deep neural networks for trading (in addition to the obvious risk of overfitting) is that the inputs into the neural network are almost always heavily pre-processed meaning that there may be few features to actually extract because the inputs are already to some extent features.

Learning Rules

As mentioned previously the objective of the neural network is to minimize some measure of error, . The most common measure of error is sum-squared-error although this metric is sensitive to outliers and may be less appropriate than tracking error in the context of financial markets.

Sum squared error (SSE),

Given that the objective of the network is to minimize we can use an optimization algorithm to adjust the weights in the neural network. The most common learning algorithm for neural networks is the gradient descent algorithm although other and potentially better optimization algorithms can be used. Gradient descent works by calculating the partial derivative of the error with respect to the weights for each layer in the neural network and then moving in the opposite direction to the gradient (because we want to minimize the error of the neural network). By minimizing the error we maximize the performance of the neural network in-sample.

Expressed mathematically the update rule for the weights in the neural network ( ) is given by,

where

where

where is the learning rate which controls how quickly or slowly the neural network converges. It is worth nothing that the calculation of the partial derivative of with respect to the net input signal for a pattern represents a problem for any discontinuous activation functions; which is one reason why alternative optimization algorithms may be used. The choice of learning rate has a large impact on the performance of the neural network. Small values for may result in very slow convergence whereas high values for could result in a lot of variance in the training.

Summary

Despite what some of the statisticians I have met in my time believe, neural networks are not just a "weak form of statistics for lazy analysts" (I have actually been told this before and it was quite funny); neural networks represent an abstraction of solid statistical techniques which date back hundreds of years. For a fantastic explanation of the statistics behind neural networks I recommend reading this chapter. That having been said I do agree that some practitioners like to treat neural networks as a "black box" which can be thrown at any problem without first taking the time to understand the nature of the problem and whether or not neural networks are an appropriate choice. An example of this is the use of neural networks for trading; markets are dynamic yet neural networks assume the distribution of input patterns remains stationary over time. This is discussed in more detail here.

Back to the top

3. Neural networks come in many architectures

Up until now we have just discussed the most simple neural network architecture, namely the multi-layer perceptron. There are many different neural network architectures (far too many to mention here) and the performance of any neural network is a function of its architecture and weights. Many modern day advances in the field of machine learning do not come from rethinking the way that perceptrons and optimization algorithms work but rather from being creative regarding how these components fit together. Below I discuss some very interesting and creative neural network architectures which have been developed over time,

Recurrent Neural Networks - some or all connections flow backwards meaning that feed back loops exist in the network. These networks are believed to perform better on time series data. As such, they may be particularly relevant in the context of the financial markets. For more information here is a link to a fantastic article entitled, The unreasonable performance of recurrent [deep] neural networks.

A more recent interesting recurrent neural network architecture is the Neural Turing Machine. This network combines a recurrent neural network architecture with memory. It has been shown that these neural networks are Turing complete and were able to learn sorting algorithms and other computing tasks.

Boltzmann neural network - one of the first fully connected neural networks was the Boltzmann neural network a.k.a Boltzmann machine. These networks were the first networks capable of learning internal representations and solving very difficult combinatoric problems. One interpretation of the Boltzmann machine is that it is a Monte Carlo version of the Hopfield recurrent neural network. Despite this, the neural network can be quite difficult to train but when constrained they can prove more efficient than traditional neural networks. The most popular constraint on Boltzmann machines is to disallow direct connections between hidden neurons. This particular architecture is referred to as a Restricted Boltzmann Machine, which are used in Deep Botlzmann Machines.

Deep neural networks - there are neural networks with multiple hidden layers. Deep neural networks have become extremely popular in more recent years due to their unparalleled success in image and voice recognition problems. The number of deep neural network architectures is growing quite quickly but some of the most popular architectures include deep belief networks, convolutional neural networks, deep restricted Boltzmann machines, stacked auto-encoders, and many more. One of the biggest problems with deep neural networks, especially in the context of financial markets which are non-stationary, is overfitting. More more info see DeepLearning.net.

Adaptive neural networks - are neural networks which simultaneously adapt and optimize their architectures whilst learning. This is done by either growing the architecture (adding more hidden neurons) or shrinking it (pruning unnecessary hidden neurons). I believe that adaptive neural networks are most appropriate for financial markets because markets are non-stationary. I say this because the features extracted by the neural network may strengthen or weaken over time depending on market dynamics. The implication of this is that any architecture which worked optimally in the past would need to be altered to work optimally today.

Radial basis networks - although not a different type of architecture in the sense of perceptrons and connections, radial basis functions make use of radial basis functions as their activation functions, these are real valued functions whose output depends on the distance from a particular point. The most commonly used radial basis functions is the Gaussian distribution. Because radial basis functions can take on much more complex forms, they were originally used for performing function interpolation. As such, a radial basis function neural network can have a much higher information capacity. Radial basis functions are also used in the kernel of a Support Vector Machine.

In summary, many hundreds of neural network architectures exist and the performance of one neural network can be significantly superior to another. As such, quantitative analysts interested in using neural networks should probably test multiple neural network architectures and consider combining their outputs together in an ensemble to maximize their investment performance. I recommend reading my article, All Your Models are Wrong, 7 Sources of Model Risk, before using Neural Networks for trading because many of the problems still apply.

Back to the top

4. Size matters, but bigger isn't always better

Having selected an architecture one must then decide how large or small the neural network should be. How many inputs are there? How many hidden neurons should be used? How many hidden layers should be used (if we are using a deep neural network)? And how many outputs neurons are required? The reasons why these questions are important is because if the neural network is too large (too small) the neural network could potentially overfit (underfit) the data meaning that the network would not generalize well out of sample.

How many and which inputs should be used?

The number of inputs depends on the problem being solved, the quantity and quality of available data, and perhaps some creativity. Inputs are simply variables which we believe have some predictive power over the dependent variable being predicted. If the inputs to a problem are unclear, you can systematically determine which variables should be included by looking at the correlations and cross-correlation between potential independent variables and the dependent variables. This approach is detailed in the article, What Drives Real GDP Growth?

There are two problems with using correlations to select input variables. Firstly, if you are using a linear correlation metric you may inadvertently exclude useful variables. Secondly, two relatively uncorrelated variables could potentially be combined to produce a strongly correlated variable. If you look at the variables in isolation you may miss this opportunity. To overcome the second problem you could use principal component analysis to extract useful eigenvectors (linear combinations of the variables) as inputs. That said a problem with this is that the eigenvectors may not generalize well and they also assume the distributions of input patterns is stationary.

Another problem when selecting variables is multicollinearity. Multicollinearity is when two or more of the independent variables being fed into the model are highly correlated. In the context of regression models this may cause regression co-efficients to change erratically in response to small changes in the model or the data. Given that neural networks and regression models are similar I suspect this is also a problem for neural networks.

Last, but not least, one statistical bias which may be introduced when selecting variables is omitted-variable bias. Omitted variable bias occurs when a model is created which leaves out one or more important causal variables. The bias is created when the model incorrectly compensates for the missing variable by over or underestimating the effect of one of the other variables i.e. the weights may become too large on these variables or SSE will be large.

How many hidden neurons should I use?

The optimal number of hidden units is problem specific. That said, as a general rule of thumb the more hidden units used the more probable the risk of overfitting becomes. Overfitting is when the neural network does not learn the underlying statistical properties of the data, but rather 'memorizes' the patterns and any noise they may contain. This results in neural networks which perform well in sample but poorly out of sample. So how can we avoid overfitting? There are two popular approaches used in industry namely early stopping and regularization and then there is my personal favourite approach, global search,

Early stopping involves splitting your training set into the main training set and a validation set. Then instead of training a neural network for a fixed number of iterations, you train then until the performance of the neural network on the validation set begins to deteriorate. Essentially this prevents the neural network from using all of the available parameters and limits it's ability to simply memorize every pattern it sees. The image on the right shows two potential stopping points for the neural network (a and b).

The image below shows the performance and over-fitting of the neural network when stopped at a or b,

Regularization penalizes the neural network for using complex architectures. Complexity in this approach is measured by the size of the neural network weights. Regularization is done by adding a term to sum squared error objective function which depends on the size of the weights. This is the equivalent of adding a prior which essentially makes the neural network believe that the function it is approximating is smooth,

where is the number of weights in the neural network. The parameters and control the degree to which the neural network over or underfits the data. Good values for and can be derived using Bayesian analysis and optimization. This, and the above, are explained in considerably more detail in this brilliant chapter.

My favourite technique, which is also by far the most computationally expensive, is global search. In this approach a search algorithm is used to try different neural network architectures and arrive at a near optimal choice. This is most often done using genetic algorithms which are discussed further on in this article.

What Are the Outputs?

Neural networks can be used for either regression or classification. Under regression model a single value is outputted which may be mapped to a set of real numbers meaning that only one output neuron is required. Under classification model an output neuron is required for each potentially class to which the pattern may belong. If the classes are unknown unsupervised neural network techniques such as self organizing maps should be used.

In conclusion, the best approach is to follow Ockhams Razor. Ockham's razor argues that for two models of equivalent performance, the model with fewer free parameters will generalize better. On the other hand, one should never opt for an overly simplistic model at the cost of performance. Similarly, one should not assume that just because a neural network has more hidden neurons and maybe more hidden layers it will outperform a much simpler network. Unfortunately it seems to me that too much emphasis is placed on large networks and too little emphasis is placed on making good design decisions. In the case of neural networks, bigger isn't always better.

Entities must not be multiplied beyond necessity - William of Ockham

Entities must not be reduced to the point of inadequacy - Karl Menger

Back to the top

5. Many training algorithms exist for neural networks

The learning algorithm of a neural network tries to optimize the neural network's weights until some stopping condition has been met. This condition is typically either when the error of the network reaches an acceptable level of accuracy on the training set, when the error of the network on the validation set begins to deteriorate, or when the specified computational budget has been exhausted. The most common learning algorithm for neural networks is the backpropagation algorithm which uses stochastic gradient descent which was discussed earlier on in this article. Backpropagation consists of two steps:

The feedforward pass - the training data set is passed through the network and the output from the neural network is recorded and the error of the network is calculated Backward propagation - the error signal is passed back through the network and the weights of the neural network are optimized using gradient descent.

The are some problems with this approach. Adjusting all the weights at once can result in a significant movement of the neural network in weight space, the gradient descent algorithm is quite slow, and is susceptible to local minima. Local minima are a problem for specific types of neural networks including all product link neural networks. The first two problems can be addressed by using variants of gradient descent including momentum gradient descent (QuickProp), Nesterov's Accelerated Momentum (NAG) gradient descent, the Adaptive Gradient Algorithm (AdaGrad), Resilient Propagation (RProp), and Root Mean Squared Propagation (RMSProp). As can be seen from the image below significant improvements can be made on the classical gradient descent algorithm.

That having been said, these algorithms cannot overcome local minima and are also less useful when trying to optimize both the architecture and weights of the neural network concurrently. In order to achieve this global optimization algorithms are needed. Two popular global optimization algorithms are the Particle Swarm Optimization (PSO) and the Genetic Algorithm (GA). Here is how they can be used to train neural networks:

Neural network vector representation - by encoding the neural network as a vector of weights, each representing the weight of a connection in the neural network, we can train neural networks using most meta-heuristic search algorithms. This technique does not work well with deep neural networks because the vectors become too large.

Particle Swarm Optimization - to train a neural network using a PSO we construct a population / swarm of those neural networks. Each neural network is represented as a vector of weights and is adjusted according to it's position from the global best particle and it's personal best.

The fitness function is calculated as the sum-squared error of the reconstructed neural network after completing one feedforward pass of the training data set. The main consideration with this approach is the velocity of the weight updates. This is because if the weights are adjusted too quickly, the sum-squared error of the neural networks will stagnate and no learning will occur.

Genetic Algorithm - to train a neural network using a genetic algorithm we first construct a population of vector represented neural networks. Then we apply the three genetic operators on that population to evolve better and better neural networks. These three operators are,

Selection - Using the sum-squared error of each network calculated after one feedforward pass, we rank the population of neural networks. The top x% of the population are selected to 'survive' to the next generation and be used for crossover. Crossover - The top x% of the population's genes are allowed to cross over with one another. This process forms 'offspring'. In context, each offspring will represent a new neural network with weights from both of the 'parent' neural networks. Mutation - this operator is required to maintain genetic diversity in the population. A small percentage of the population are selected to undergo mutation. Some of the weights in these neural networks will be adjusted randomly within a particular range.

In addition to these population-based metaheuristic search algorithms, other algorithms have been used to train of neural networks including backpropagation with added momentum, differential evolution, Levenberg Marquardt, simulated annealing, and many more. Personally I would recommend using a combination of local and global optimization algorithms to overcome the shortcomings of both.

Back to the top

6. Neural networks do not always require a lot of data

Neural networks can use one of three learning strategies namely a supervised learning strategy, an unsupervised learning strategy, or a reinforcement learning strategy. Supervised learning require at least two data sets, a training set which consists of inputs with the expected output, and a testing set which consists of inputs without the expected output. Both of these data sets must consist of labelled data i.e. data patterns for which the target is known upfront. Unsupervised learning strategies are typically used to discover hidden structures (such as hidden Markov chains) in unlabeled data. They behave in a similar way to clustering algorithms. Reinforcement learning are based on the simple premise of rewarding neural networks for good behaviours and punishing them for bad behaviours. Because unsupervised and reinforcement learning strategies do not require that data be labelled they can be applied to under-formulated problems where the correct output is not known.

Unsupervised Learning

One of the most popular unsupervised neural network architectures is the Self Organizing Map (also known as the Kohonen Map). Self Organizing Maps are essentially a multi-dimensional scaling technique which construct an approximation of the probability density function of some underlying data set, , whilst preserving the topological structure of that data set. This is done by mapping input vectors, , in the data set, , to weight vectors, , (neurons) in the feature map, . Preserving the topological structure simply means that if two input vectors are close together in , then the neurons to which those input vectors map in will also be close together.

For more information on self organizing maps and how they can be used to produce lower-dimensionality data sets click here. Another interesting application of SOM's is in colouring time series charts for stock trading. This is done to show what the market conditions are at that point in time. This website provides a detailed tutorial and code snippets for implementing the idea for improved Forex trading strategies.

Reinforcement Learning

Reinforcement learning strategies consist of three components. A policy which specifies how the neural network will make decisions e.g. using technical and fundamental indicators. A reward function which distinguishes good from bad e.g. making vs. losing money. And a value function which specifies the long term goal. In the context of financial markets (and game playing) reinforcement learning strategies are particularly useful because the neural network learns to optimize a particular quantity such as an appropriate measure of risk adjusted return.

Back to the top

7. Neural networks cannot be trained on any data

One of the biggest reasons why neural networks may not work is because people do not properly pre-process the data being fed into the neural network. Data normalization, removal of redundant information, and outlier removal should all be performed to improve the probability of good neural network performance.

Data normalization - neural networks consist of various layers of perceptrons linked together by weighted connections. Each perceptron contains an activation function which each have an 'active range' (except for radial basis functions). Inputs into the neural network need to be scaled within this range so that the neural network is able to differentiate between different input patterns.

For example, given a neural network trading system which receives indicators about a set of securities as inputs and outputs whether each security should be bought or sold. One of the inputs is the price of the security and we are using the Sigmoid activation function. However, most of the securities cost between 5$ and 15$ per share and the output of the Sigmoid function approaches 1.0. So the output of the Sigmoid function will be be 1.0 for all securities, all of the perceptrons will 'fire' and the neural network will not learn.

Neural networks trained on unprocessed data produce models where 'the lights are on but nobody's home'

Outlier removal - an outlier is value that is much smaller or larger than most of the other values in some set of data. Outliers can cause problems with statistical techniques like regression analysis and curve fitting because when the model tries to 'accommodate' the outlier, performance of the model across all other data deteriorates,

The illustration shows that trying to accommodate an outlier into the linear regression model results in a poor fits of the data set. The effect of outliers on non-linear regression models, including neural networks, is similar. Therefore it is good practice is to remove outliers from the training data set. That said, identifying outliers is a challenge in and of itself, this tutorial and paper discuss existing techniques for outlier detection and removal.

Remove redundancy - when two or more of the independent variables being fed into the neural network are highly correlated (multiplecolinearity) this can negatively affect the neural networks learning ability. Highly correlated inputs also mean that the amount of unique information presented by each variable is small, so the less significant input can be removed. Another benefit to removing redundant variables is faster training times. Adaptive neural networks can be used to prune redundant connections and perceptrons.

Back to the top

8. Neural networks may need to be retrained

Given that you were able to train a neural network to trade successfully in and out of sample this neural network may still stop working over time. This is not a poor reflection on neural networks but rather an accurate reflection of the financial markets. Financial markets are complex adaptive systems meaning that they are constantly changing so what worked yesterday may not work tomorrow. This characteristic is called non-stationary or dynamic optimization problems and neural networks are not particularly good at handling them.

Dynamic environments, such as financial markets, are extremely difficult for neural networks to model. Two approaches are either to keep retraining the neural network over-time, or to use a dynamic neural network. Dynamic neural networks 'track' changes to the environment over time and adjust their architecture and weights accordingly. They are adaptive over time. For dynamic problems, multi-solution meta-heuristic optimization algorithms can be used to track changes to local optima over time. One such algorithm is the multi-swarm optimization algorithm, a derivative of the particle swarm optimization. Additionally, genetic algorithms with enhanced diversity or memory have also been shown to be robust in dynamic environments.

The illustration below demonstrates how a genetic algorithm evolves over time to find new optima in a dynamic environment. This illustration also happens to mimic trade crowding which is when market participants crowd a profitable trading strategy, thereby exhausting trading opportunities causing the trade to become less profitable.

Back to the top

9. Neural networks are not black boxes

By itself a neural network is a black-box. This presents problems for people wanting to use them. For example, fund managers wouldn't know how a neural network makes trading decisions, so it is impossible to assess the risks of the trading strategies learned by the neural network. Similarly, banks using neural networks for credit risk modelling would not be able to justify why a customer has a particular credit rating, which is a regulatory requirement. That having been said, state of the art rule-extraction algorithms have been developed to vitrify some neural network architectures. These algorithms extract knowledge from the neural networks as either mathematical expressions, symbolic logic, fuzzy logic, or decision trees.

Mathematical rules - algorithms have been developed which can extract multiple linear regression lines from neural networks. The problem with these techniques is that the rules are often still difficult to understand, therefore these do not solve the 'black-box' problem.

Propositional logic - propositional logic is a branch of mathematical logic which deals with operations done on discrete valued variables. These variables, such as A or B, are often either TRUE or FALSE, but they could occupy values within a discrete range e.g. {BUY,HOLD,SELL}.

Logical operations can then be applied to those variables such as OR, AND, and XOR. The results are called predicates which can also be quantified over sets using the exists or for-all quantifiers. This is the difference between predicate and propositional logic. If we had a simple neural network which Price (P), Simple Moving Average (SMA), and Exponential Moving Average (EMA) as inputs and we extracted a trend following strategy from the neural network in propositional logic, we might get rules like this,

Fuzzy logic - fuzzy logic is where probability and propositional logic meet. The problem with propositional logic is that is deals in absolutes e.g. BUY or SELL, TRUE or FALSE, 0 or 1. Therefore for traders there is no way to determine the confidence of these results. Fuzzy logic overcomes this limitation by introducing a membership function which specifies how much a variable belongs to a particular domain. For example, a company (GOOG) might belong 0.7 to the domain {BUY} and 0.3 to the domain {SELL}. Combinations of neural networks and fuzzy logic are called Neuro-Fuzzy systems. This research survey discusses various fuzzy rule extraction techniques.

Decision trees - decision trees show how decisions are made when given certain information. This article describes how to evolve security analysis decision trees using genetic programming. Decision tree induction is the term given to the process of extracting decision trees from neural networks.

Back to the top

10. Neural networks are not hard to implement

This list is updated, from time to time, when I have time. Last updated: November 2015.

Speaking from experience, neural networks are quite challenging to code from scratch. Luckily there are now hundreds open source and proprietary packages which make working with neural networks a lot easier. Below is a list of packages which quants may find useful for quantitative finance. The list is NOT exhaustive, and is ordered alphabetically. If you have any additional comments, or frameworks to add, please share via the comment section.

Caffe

Webpage - http://caffe.berkeleyvision.org/

GitHub Repository - https://github.com/BVLC/caffe

"Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.Yangqing Jia created the project during his PhD at UC Berkeley." - Caffe webpage (November 2015)

Encog

Webpage - http://www.heatonresearch.com/encog/

GitHub Repositories - https://github.com/encog

"Encog is an advanced machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data. Machine learning algorithms such as Support Vector Machines, Artificial Neural Networks, Genetic Programming, Bayesian Networks, Hidden Markov Models, Genetic Programming and Genetic Algorithms are supported. Most Encog training algoritms are multi-threaded and scale well to multicore hardware. Encog can also make use of a GPU to further speed processing time. A GUI based workbench is also provided to help model and train machine learning algorithms." - Encog webpage

H2O

Webpage - http://h2o.ai/

GitHub Repositories - https://github.com/h2oai

H2O is not strictly a package for machine learning, instead they expose an API for doing fast and scalable machine learning for smarter applications which use big data. Their API supports deep learning model, generalized boosting models, generalized linear models, and more. They also host a cool conference, checkout the videos :).

Google TensorFlow

Webpage - http://www.tensorflow.org/

GitHub repository - https://github.com/tensorflow/tensorflow

"TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture lets you deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code." - GitHub repository (November 2015)

Microsoft Distributed Machine Learning Tookit

Webpage - http://www.dmtk.io/

GitHub repository - https://github.com/Microsoft/DMTK

"DMTK includes the following projects: DMTK framework(Multiverso): The parameter server framework for distributed machine learning. LightLDA: Scalable, fast and lightweight system for large-scale topic modeling. Distributed word embedding: Distributed algorithm for word embedding. Distributed skipgram mixture: Distributed algorithm for multi-sense word embedding." - GitHub repository (November 2015)

Microsoft Azure Machine Learning

Webpage - https://azure.microsoft.com/en-us/services/machine-learning

GitHub Repositories - https://github.com/Azure?utf8=%E2%9C%93&query=MachineLearning

The machine learning / predictive analytics platform in Microsoft Azure is a fully managed cloud service that enables you to easily build, deploy, and share predictive analytics solutions. This software basically allows you to drag and drop pre-built components (including machine learning models) and custom-built components which manipulate data sets into a process. This flow-chart is then compiled into a program and can be deployed as a web-service. It is similar to the older SAS enterprise miner solution except that is it more modern, more functional, supports deep learning models, and exposes clients for Python and R.

MXNet

Webpage - http://mxnet.readthedocs.org/en/latest/

GitHub Repositories - https://github.com/dmlc/mxnet

"MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix the flavours of symbolic programming and imperative programming together to maximize the efficiency and your productivity. In its core, a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer is build on top, which makes symbolic execution fast and memory efficient. The library is portable and lightweight, and is ready scales to multiple GPUs, and multiple machines." - MXNet GitHub Repository (November 2015)

Neon

Webpage - http://neon.nervanasys.com/docs/latest/index.html

GitHub Repository - https://github.com/nervanasystems/neon

"neon is Nervana's Python based Deep Learning framework and achieves the fastest performance on many common deep neural networks such as AlexNet, VGG and GoogLeNet. We have designed it with the following functionality in mind: 1) Support for commonly used models and examples: convnets, MLPs, RNNs, LSTMs, autoencoders, 2) Tight integration with nervanagpu kernels for fp16 and fp32 (benchmarks) on Maxwell GPUs, 3) Basic automatic differentiation support, 4) Framework for visualization, and 5) Swappable hardware backends ..." - neon GitHub repository (November 2015)

Theano

Webpage - http://deeplearning.net/software/theano/

GitHub repository - https://github.com/Theano/Theano

"Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation." - Theano GitHub repository (November 2015). Theano, like TensorFlow and Torch, is more broadly applicable than just Neural Networks. It is a framework for implementing existing or creating new machine learning models using off-the-shelf data-structures and algorithms.

Torch

Webpage - http://torch.ch/

GitHub Repository - https://github.com/torch/torch7

"Torch is a scientific computing framework with wide support for machine learning algorithms ... A summary of core features include an N-dimensional array, routines for indexing, slicing, transposing, an interface to C, via LuaJIT, linear algebra routines, neural network, energy-based models, numeric optimization routines, Fast and efficient GPU support, Embeddable, with ports to iOS, Android and FPGA" - Torch Webpage (November 2015). Like Tensorflow and Theano, Torch is more broadly applicable than just Neural Networks. It is a framework for implementing existing or creating new machine learning models using off-the-shelf data-structures and algorithms.

SciKit Learn

Webpage - http://scikit-learn.org/stable/

GitHub Repository - https://github.com/scikit-learn/scikit-learn

SciKit Learn is a very popular package for doing machine learning in Python. It is built on NumPy, SciPy, and matplotlib Open source, and exposes implementations of various machine learning models for classification, regression, clustering, dimensionality reduction, model selection, and data preprocessing.

As I mentioned, there are now hundreds of machine learning packages and frameworks out there. Before committing to any one solution I would recommend doing a best-fit analysis to see which open source or proprietary machine learning package or software best matches your use-cases. Generally speaking a good rule to follow in software engineering and model development for quantitative finance is to not reinvent the wheel ... that said, for any sufficiently advanced model you should expect to have to write some of your own code.

Back to the top

Conclusion

Neural networks are a class of powerful machine learning algorithms. They are based on solid statistical foundations and have been applied successfully in financial models as well as in trading strategies for many years. Despite this, they have a bad reputation due to the many unsuccessful attempts to use them in practice. In most cases, unsuccessful neural network implementations can be traced back to inappropriate neural network design decisions and general misconceptions about how they work. This article aims to articulate some of these misconceptions in the hopes that they might help individuals implementing neural networks meet with success.

For readers interested in getting more information, I have found the following books to be quite instructional when it comes to neural networks and their role in financial modelling and algorithmic trading.