Summary

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of neural networks that has successfully been applied to image recognition and analysis. In this project I've approached this class of models trying to apply it to stock market prediction, combining stock prices with sentiment analysis. The implementation of the network has been made using TensorFlow, starting from the online tutorial. In this article, I will describe the following steps: dataset creation, CNN training and evaluation of the model.

Dataset

In this section, it's briefly described the procedure used to build the dataset, the data sources and the sentiment analysis performed.

Ticks

In order to build a dataset, I first chose a sector and I time period to focus on. I decided to pick up the Healthcare sector and the time range between 4th January 2016 and 30th September 2017, to be further splitted in training set and evaluation set. In particular, the list of ticks was downloaded from nasdaq.com, keeping only companies with Mega, Large or Mid capitalization. Starting from this list of ticks, stocks and news data were retrieved using Google Finance and Intrinio API respectively.

Stocks Data

As already mentioned before, stocks data has been retrieved from Google Finance historical API ("https://finance.google.com/finance/historical?q={tick}&startdate={startdate}&output=csv", for each tick in the list).

The time unit is the day and the value I kept is the Close price. For training purposes, missing days have been filled using linear interpolation (pandas.DataFrame.interpolate):

News Data and Sentiment Analysis

For each tick, I downloaded the related news from "https://api.intrinio.com/news.csv?ticker={tick}". Data are in csv format with the following columns:

TICKER,FIGI_TICKER,FIGI,TITLE,PUBLICATION_DATE,URL,SUMMARY, here an example:

"AAAP,AAAP:UW,BBG007K5CV53,"3 Stocks to Watch on Thursday: Advanced Accelerator Application SA(ADR) (AAAP), Jabil Inc (JBL) and Medtronic Plc. (MDT)",2017-09-28 15:45:56 +0000,http://articlefeeds.nasdaq.com/~r/nasdaq/symbols/~3/ywZ6I5j5mIE/3-s... Market News Stock Advice amp Trading Tips Most major U S indices rose Wednesday with financial stocks leading the way popping 1 3 The 160 S amp P 500 Index gained 0 4 the 160 Dow Jones Industrial Average surged 0 3 and the 160".





News have been de-duplicated based on the title. Finally, TICKER, PUBLICATION_DATE and SUMMARY columns were kept.

Sentiment Analysis was performed on the SUMMARY column using Loughran and McDonald Financial Sentiment Dictionary for financial sentiment analysis, implemented in the pysentiment python library.

This library offers both a tokenizer, that performs also stemming and stop words removal, and a method to score a tokenized text. The value chosen from the get_score method as a proxy of the sentiment is the Polarity, computed as:

(#Positives - #Negatives)/(#Positives + #Negatives)

import pysentiment as ps



lm = ps . LM() df_news[ 'SUMMARY_SCORES' ] = df_news . SUMMARY . map( lambda x: lm . get_score(lm . tokenize(str(x)))) df_news[ 'POLARITY' ] = df_news[ 'SUMMARY_SCORES' ] . map( lambda x: x[ 'Polarity' ])



The days in which there are no news are filled with 0s for Polarity.

Finally, data was groupped by tick and date, summing up the Polarity score for days in which a tick has more than one news.

Full Dataset

By merging stocks and news data, we get a dataset as follows, with all the days from 2016-01-04 to 2017-09-30 for 154 ticks, with the close value of the stock and the respective polarity value:

Date Tick Close Polarity 2017-09-26 ALXN 139.700000 2.333332 2017-09-27 ALXN 139.450000 3.599997 2017-09-28 ALXN 138.340000 1.000000 2017-09-29 ALXN 140.290000 -0.999999

CNN with TensorFlow

In order to get started with Convolutional Neural Network in Tensorflow, I used the official tutorial as reference. It shows how to use layers to build a convolutional neural network model to recognize the handwritten digits in the MNIST data set. In order to make this working for our purpose, we need to adapt our input data and the network.

Data Model

The input data has been modelled such that a single features element is a 154x100x2 tensor:



154 ticks;

100 consecutive days;

2 channels, one for the stock price and one for the polarity value.

Lables instead are modelled as a vector of length 154, where each element is 1, if the corrresponding stock raised on the next day, 0 otherwise.

In tihs way, there is a sliding time window of 100 days, so the first 100 days can't be used as labels. The training set contains 435 entries, while the evaluation set 100.

Convolutional Neural Network

The CNN has been built starting from the example of TensorFlow's tutorial and then adapted to this use case. The first 2 convolutional and pooling layers have both height equal to 1, so they perform convolutions and poolings on single stocks, the last layer has height equal to 154, to learn correlations between stocks. Finally, there are the dense layers, with the last one of length 154, one for each stock.



The network has been dimensioned in a way that it could be trained in a couple of hours on this dataset using a laptop. Part of the code is reported here:

def cnn_model_fn (features, labels, mode):



"""Model function for CNN."""



# Input Layer

input_layer = tf . reshape(tf . cast(features[ "x" ], tf . float32), [ - 1 , 154 , 100 , 2 ])



# Convolutional Layer #1

conv1 = tf . layers . conv2d(

inputs = input_layer,

filters = 32 ,

kernel_size = [ 1 , 5 ],

padding = "same" ,

activation = tf . nn . relu)



# Pooling Layer #1

pool1 = tf . layers . max_pooling2d(inputs = conv1, pool_size = [ 1 , 2 ], strides = [ 1 , 2 ])



# Convolutional Layer #2

conv2 = tf . layers . conv2d(

inputs = pool1,

filters = 8 ,

kernel_size = [ 1 , 5 ],

padding = "same" ,

activation = tf . nn . relu)



# Pooling Layer #2

pool2 = tf . layers . max_pooling2d(inputs = conv2, pool_size = [ 1 , 5 ], strides = [ 1 , 5 ])



# Convolutional Layer #3

conv3 = tf . layers . conv2d(

inputs = pool2,

filters = 2 ,

kernel_size = [ 154 , 5 ],

padding = "same" ,

activation = tf . nn . relu)



# Pooling Layer #3

pool3 = tf . layers . max_pooling2d(inputs = conv3, pool_size = [ 1 , 2 ], strides = [ 1 , 2 ])



# Dense Layer

pool3_flat = tf . reshape(pool3, [ - 1 , 154 * 5 * 2 ])



dense = tf . layers . dense(inputs = pool3_flat, units = 512 , activation = tf . nn . relu)



dropout = tf . layers . dropout(

inputs = dense, rate = 0.4 , training = mode == tf . estimator . ModeKeys . TRAIN)



# Logits Layer

logits = tf . layers . dense(inputs = dropout, units = 154 )



predictions = {

# Generate predictions (for PREDICT and EVAL mode)

"classes" : tf . argmax(input = logits, axis = 1 ),

"probabilities" : tf . nn . softmax(logits, name = "softmax_tensor" )

}



if mode == tf . estimator . ModeKeys . PREDICT:

return tf . estimator . EstimatorSpec(mode = mode, predictions = predictions)



# Calculate Loss (for both TRAIN and EVAL modes)

multiclass_labels = tf . reshape(tf . cast(labels, tf . int32), [ - 1 , 154 ])

loss = tf . losses . sigmoid_cross_entropy(

multi_class_labels = multiclass_labels, logits = logits)



# Configure the Training Op (for TRAIN mode)

if mode == tf . estimator . ModeKeys . TRAIN:

optimizer = tf . train . GradientDescentOptimizer(learning_rate = 0.001 )

train_op = optimizer . minimize(

loss = loss,

global_step = tf . train . get_global_step())

return tf . estimator . EstimatorSpec(mode = mode, loss = loss, train_op = train_op)



Evaluation

In order to evaluate the performance of the model, no standard metrics were used, but it has been built a simulation closer to a practical use of the model.

Assuming to start with an initial capital (C) equal to 1, for each day of the evaluation set we divide the capital in N equal parts, where N goes from 1 to 154. We put C/N on the top N stocks that our model predicts with the highest probabilities, 0 on the others. At this point we have a vector A that represents our daily allocation, we can compute the daily gain/loss as A multiplied by the percentage variation of each stock for that day. We and up with a new capital C = C + delta, that we can re-invest on the next day. At the end, we will end up with a capital greater or smaller than 1, depending on the goodness of our choices.

A good baseline for the model has been identified in N=154: this represents the generic performance of all the stocks and it models the scenario in which we divide the capital equally on all of them. This produces a gain around 4.27%.

For evaluation purposes, the data has been corrected, removing the days in which the market was closed.

The performance of the model, for different values of N, is reported in the picture below.

The red dotted line is the 0 baseline, while the orange line is the basline with N=154.

The best performance is obtained with N=12, with a gain around 8.41%, almost twice the market baseline.

For almost every N greater than 10 we have a decent performance, better than the baseline, while too small values of N degrade the performance.

Conclusion

It has been very interesting to try Tensorflow and CNN for the first time and trying to apply them to financial data.

This is a toy example, using quite small dataset and network, but it shows the potential of this models.

Please feel free to provide feedback and advice or simply to get in touch with me on LinkedIn