See Part 2 of the series here.

So in the last entry, I detailed the code I wrote to implement my neural network, which was a feed-forward network that backpropagates errors. The focus for this entry is to try to make some predictions based on freely available stock price data (from Yahoo! Finance), and get some rough estimates of how well the network is able to forecast.

Code From Previous Entries

Since this entry builds on the previous two, I figured it would be helpful to present that code in one place below.

import sys import os import random import math import numpy as np import pandas as pd import yahoo_finance as yf import time def getHistoricalData(symbol_name, start_date, end_date, save_data=False, pathname=""): symbol = yf.Share(symbol_name) price_data = symbol.get_historical(start_date, end_date) price_df = pd.DataFrame(price_data) if save_data: if len(pathname) > 0: if not os.path.exists(pathname): os.makedirs(pathname) filename = pathname + "\\" + symbol_name + "_" + start_date + "_" + end_date + ".csv" print "Ticker data for",symbol_name,"saved to:", pathname else: filename = symbol_name + "_" + start_date + "_" + end_date + ".csv" print "Ticker data for",symbol_name,"saved to local directory" price_df.to_csv(filename) return price_df class Node(object): def __init__(self,number_of_inputs): self.inputs = number_of_inputs self.bias = np.random.uniform(0.0,1.0) #self.weights = np.array([0.5] * number_of_inputs) self.weights = np.array([np.random.uniform(0.0,1.0)] * number_of_inputs) self.output = 0.0 def output(self): return self.output def debug_info(self): info = "Bias: %f ; Weights:"%(self.bias) for w in self.weights: info += "%f," %(w) return info def getWeightAtIdx(self,idx): return self.weights[idx] def getBias(self): return self.bias def calculateActivity(self,input_vector): #linear basis function activity = self.bias activity += np.dot(input_vector,self.weights) return activity def activationFunction(self,input_value): # Sigmoid Activation return 1.0/(1.0 + math.exp(-input_value)) def calculate(self,input_vector): activity_value = self.calculateActivity(input_vector) self.output = self.activationFunction(activity_value) def updateWeights(self,alpha,delta): adjustment = self.output * delta * alpha self.bias = self.bias + adjustment self.weights = self.weights + adjustment class FeedForwardNet(object): def __init__(self,no_of_inputs,no_of_hidden_layers,nodes_in_hiddens,no_of_outputs,learning_rate): self.number_of_inputs = no_of_inputs self.number_of_hidden_layers = no_of_hidden_layers self.hidden_nodes = [] self.hidden_outputs = [] self.hidden_nodes.append(np.array([Node(no_of_inputs) for x in range(nodes_in_hiddens[0])])) self.hidden_outputs.append(np.array([0.0 for x in range(nodes_in_hiddens[0])])) if no_of_hidden_layers > 1: for i in range(1,len(nodes_in_hiddens)): self.hidden_nodes.append(np.array([Node(nodes_in_hiddens[i-1]) for x in range(nodes_in_hiddens[i])])) self.hidden_outputs.append(np.array([0.0 for x in range(nodes_in_hiddens[i])])) self.hidden_node_list = nodes_in_hiddens self.output_layer = np.array([Node(nodes_in_hiddens[-1]) for i in range(no_of_outputs)]) self.number_of_outputs = no_of_outputs self.network_output = np.array([0.0 for i in range(no_of_outputs)]) self.errors = np.array([0.0 for i in range(no_of_outputs)]) self.alpha = learning_rate def getNetOutputs(self): return self.network_output def debug_info(self): print "Number of Inputs: ", self.number_of_inputs print "Number of Hidden Nodes: ", self.hidden_node_list print "Number of Outputs: ", self.number_of_outputs print "Hidden Layer Node Weights:" count = 1 for layer in self.hidden_nodes: print "Hidden Layer",count,":" count +=1 for node in layer: print node.debug_info() print "Ouput Layer Node Weights:" for node in self.output_layer: print node.debug_info() print "Output from network:" print self.network_output print "Network Errors:" print self.errors def FeedForward(self,input_vector,true_outputs=None,Training=False): for y in range(len(self.hidden_nodes)): layer = self.hidden_nodes[y] output = self.hidden_outputs[y] for x in range (len(layer)): layer[x].calculate(input_vector) output[x] = layer[x].output input_vector = output hidden_output = self.hidden_outputs[-1] for x in range(self.number_of_outputs): self.output_layer[x].calculate(hidden_output) self.network_output[x] = self.output_layer[x].output if Training: self.errors[x] = true_outputs[x] - self.output_layer[x].output self.BackPropagate() else: return self.network_output def BackPropagate(self): deltas_for_layer = [] for i in range(self.number_of_outputs): output = self.network_output[i] delta_o = self.errors[i] * (output * (1.0-output)) self.output_layer[i].updateWeights(self.alpha,delta_o) deltas_for_layer.append(delta_o) prev_layer = self.output_layer for y in range(len(self.hidden_nodes)): layer = self.hidden_nodes[-(1+y)] prev_layer_factor = 0 current_layer_deltas = [] for j in range(len(layer)): output = layer[j].output for x in range(len(prev_layer)): prev_layer_factor += prev_layer[x].getWeightAtIdx(j) * deltas_for_layer[x] delta_h = (output * (1.0-output)) * prev_layer_factor current_layer_deltas.append(delta_h) layer[j].updateWeights(self.alpha,delta_h) prev_layer = layer deltas_for_layer = current_layer_deltas

E:\Anaconda2\lib\site-packages\pandas\computation\__init__.py:19: UserWarning: The installed version of numexpr 2.4.4 is not supported in pandas and will be not be used UserWarning)

In addition to that code, I also have the following two helper functions. The first is one that simply seeds both the Python Standard Libary random number generator and the NumPy random number generator. In order to be able to accurately recreate results when using any type of algorithm that employs randomness in some fashion, seeding your random number generator and saving the seed is extremely important. So this function provides a mechanism to do so.

The second function provides stats for a Pandas Dataframe. It’s purpose is basically to provide me with debugging information about my datasets.

def Seed_RNG(seed_val): print "RANDOM SEED: ",seed_val random.seed(random_seed) np.random.seed(random_seed) def examine_data_frame( df): for name in df.columns: print "----------" print df[ name].dtype if df[ name].dtype is np.dtype( 'O'): print df[ name].value_counts() print "Name: ", name else: print df[ name].describe()

Initial Thoughts On Training

As a starting point, I think that the S&P 500 ETF (exchange-traded fund) SPY is a good first choice for a ‘stock’ to attempt to forecast. SPY (and the S&P 500) tend to be used as benchmarks for trading and investing strategies. Quantopian uses it as the benchmark to beat, and (at least in my opinion) is a fairly low volatility fund to follow. I think this low volatility may make the task of prediction easier.

The data provided from Yahoo! is the opening, closing, and adjusted closing price, daily high, daily low, and the trading volume for each day. The next question is from these 6 variables, what should be presented to the network? I think that the daily high and lows should be avoided. This data can, sometimes, be a bit unreliable, as different sources may have different values for these. Additionally, if I were to implement some type of system where the price is monitored in real-time (like with a live feed of price data), I don’t have knowledge of what the high/low for the day is until the day is over. My concern with using high/low is that it seems fairly easy to introduce future data into the system. So I’m not going to use it. For now, I’ll just be using the open/close prices and the trading volume as inputs.

Finally, I think that there are two things I’m going to try to predict: whether a stock will close higher tomorrow and whether a stock will close higher a week from today (which is 5 trading days).

spy = "SPY" start = "2013-01-01" end = "2017-01-01" spy_df = getHistoricalData("SPY",start,end) spy_df[['Open','Close','Volume']] = spy_df[['Open','Close','Volume']].apply(pd.to_numeric) spy_df.info()

RangeIndex: 1008 entries, 0 to 1007 Data columns (total 8 columns): Adj_Close 1008 non-null object Close 1008 non-null float64 Date 1008 non-null object High 1008 non-null object Low 1008 non-null object Open 1008 non-null float64 Symbol 1008 non-null object Volume 1008 non-null int64 dtypes: float64(2), int64(1), object(5) memory usage: 63.1+ KB

Preprocessing the data

So this data looks pretty much as I would expect. The data does need some preprocessing however, namely some normalization and stardardization. This isn’t absolutely necessary with neural networks, unlike logistic regression where it is, but it still is a very good idea. Doing so, along with the random initializations of weights, allows for the network to more quickly reduce errors. Additionally, the open/close price and trading volume have very different scales, so normalizing helps get us to more of an ‘apples to apples’ situation with our variables. To do this, I’ve taken the most basic approach, which is to center about the mean by subtracting the mean from the value then dividing by 2x the standard deviation.

However, I don’t think that the mean and standard deviations should be computed for the entire data set. Since this is a time series, at time t, you wouldn’t know the mean or standard deviation for the entire dataset, as it includes data from time t+1 and greater. To get around introducing future data to the network, I think the best approach would be to use a moving average and standard deviation for normalization.

The code below does this. The code takes in a number of days to scale on, the variable to scale, and the data frame. A scaled version of this variable is added to the dataframe. Then the function will loop through the data and compute a mean and standard deviation, including the past input number of days. The value is then scaled based on that and added to the frame.

def scale_on_lookback_window(num_of_days,variable,dataframe): scaled_var = variable + "_scaled" dataframe[scaled_var] = np.nan var_array = dataframe.as_matrix(columns = [variable]) #print num_of_days, len(dataframe[scaled_var]) for i in range(num_of_days,len(dataframe[scaled_var])): data_slice = var_array[(i-num_of_days):i] #print data_slice[0] data_avg = np.mean(data_slice) data_std = np.std(data_slice) dataframe[scaled_var][i] = (dataframe[variable][i] - data_avg) / (2.0*data_std)

scale_window = 30 scale_on_lookback_window(scale_window,"Open",spy_df) scale_on_lookback_window(scale_window,"Close",spy_df) scale_on_lookback_window(scale_window,"Volume",spy_df) spy_df.drop(spy_df.index[0:scale_window], inplace=True) spy_df.reset_index(drop=True,inplace=True) spy_df.info() spy_df.head()

RangeIndex: 978 entries, 0 to 977 Data columns (total 11 columns): Adj_Close 978 non-null object Close 978 non-null float64 Date 978 non-null object High 978 non-null object Low 978 non-null object Open 978 non-null float64 Symbol 978 non-null object Volume 978 non-null int64 Open_scaled 978 non-null float64 Close_scaled 978 non-null float64 Volume_scaled 978 non-null float64 dtypes: float64(5), int64(1), object(5) memory usage: 84.1+ KB E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Adj_Close Close Date High Low Open Symbol Volume Open_scaled Close_scaled Volume_scaled 0 216.593374 217.869995 2016-11-16 218.139999 217.419998 217.559998 SPY 65617700 -0.992043 -0.975247 -0.280858 1 217.000975 218.279999 2016-11-15 218.279999 216.800003 217.039993 SPY 91652600 -0.985221 -0.819481 0.189329 2 215.320876 216.589996 2016-11-14 217.270004 215.720001 217.029999 SPY 94580000 -0.895887 -1.030917 0.219208 3 215.151874 216.419998 2016-11-11 216.699997 215.320007 216.080002 SPY 100552700 -0.969539 -0.957311 0.306738 4 215.648944 216.919998 2016-11-10 218.309998 215.220001 217.300003 SPY 172113300 -0.707726 -0.802213 1.601762

The last bit of preprocessing I’ll be doing to the dataset is adding the true values for what we’re trying to predict. I’ll add variables representing tomorrow’s close, next week’s close, and binary variables indicating if the closing price is up from today. You don’t necessarily need to do this, but I think this will make processing the data easier.

<br />print spy_df.head() spy_df["tomorrow_close"] = np.nan spy_df["week_close"] = np.nan spy_df["tomorrow_up"] = np.nan spy_df["week_up"] = np.nan for i in range(1,len(spy_df["tomorrow_close"])): spy_df["tomorrow_close"][i-1] = spy_df["Close_scaled"][i] if spy_df["Close_scaled"][i] > spy_df["Close_scaled"][i-1]: spy_df["tomorrow_up"][i-1] = 1 else: spy_df["tomorrow_up"][i-1] = 0 for i in range(5,len(spy_df["tomorrow_close"])): spy_df["week_close"][i-5] = spy_df["Close_scaled"][i] if spy_df["Close_scaled"][i] > spy_df["Close_scaled"][i-5]: spy_df["week_up"][i-5] = 1 else: spy_df["week_up"][i-5] = 0 spy_df.drop(spy_df.index[-5:], inplace=True) spy_df.info() spy_df.head()

E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:15: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:20: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:22: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy Adj_Close Close Date High Low Open \ 0 216.593374 217.869995 2016-11-16 218.139999 217.419998 217.559998 1 217.000975 218.279999 2016-11-15 218.279999 216.800003 217.039993 2 215.320876 216.589996 2016-11-14 217.270004 215.720001 217.029999 3 215.151874 216.419998 2016-11-11 216.699997 215.320007 216.080002 4 215.648944 216.919998 2016-11-10 218.309998 215.220001 217.300003 Symbol Volume Open_scaled Close_scaled Volume_scaled 0 SPY 65617700 -0.992043 -0.975247 -0.280858 1 SPY 91652600 -0.985221 -0.819481 0.189329 2 SPY 94580000 -0.895887 -1.030917 0.219208 3 SPY 100552700 -0.969539 -0.957311 0.306738 4 SPY 172113300 -0.707726 -0.802213 1.601762 Int64Index: 973 entries, 0 to 972 Data columns (total 15 columns): Adj_Close 973 non-null object Close 973 non-null float64 Date 973 non-null object High 973 non-null object Low 973 non-null object Open 973 non-null float64 Symbol 973 non-null object Volume 973 non-null int64 Open_scaled 973 non-null float64 Close_scaled 973 non-null float64 Volume_scaled 973 non-null float64 tomorrow_close 973 non-null float64 week_close 973 non-null float64 tomorrow_up 973 non-null float64 week_up 973 non-null float64 dtypes: float64(9), int64(1), object(5) memory usage: 121.6+ KB

Adj_Close Close Date High Low Open Symbol Volume Open_scaled Close_scaled Volume_scaled tomorrow_close week_close tomorrow_up week_up 0 216.593374 217.869995 2016-11-16 218.139999 217.419998 217.559998 SPY 65617700 -0.992043 -0.975247 -0.280858 -0.819481 -0.820529 1.0 1.0 1 217.000975 218.279999 2016-11-15 218.279999 216.800003 217.039993 SPY 91652600 -0.985221 -0.819481 0.189329 -1.030917 -1.078715 0.0 0.0 2 215.320876 216.589996 2016-11-14 217.270004 215.720001 217.029999 SPY 94580000 -0.895887 -1.030917 0.219208 -0.957311 -1.105524 1.0 0.0 3 215.151874 216.419998 2016-11-11 216.699997 215.320007 216.080002 SPY 100552700 -0.969539 -0.957311 0.306738 -0.802213 -1.597586 1.0 0.0 4 215.648944 216.919998 2016-11-10 218.309998 215.220001 217.300003 SPY 172113300 -0.707726 -0.802213 1.601762 -0.820529 -1.325594 0.0 0.0

Training and Predictions

Below is my code for running the whole dataset created above. The basic process here is:

* Set a number of days to use in training

* Take that number of days – 1 to train the network on

* Use the last day in the window to make a prediction

* If the prediction is greater than that day’s close: the prediction is 1

* Else: the prediction is 0

This style of training the network is called online learning, where you feed it sequentially, as opposed to training with batch-style techniques. Additionally, we don’t keep the network from day-to-day, meaning each day we develop a new model for the stock price. The main reason I’m doing this is to try to prevent some amount of overfitting the network. My thought is that if there is some correlation between a stock’s past price and it’s future price, it’s will be heavily weighted to the most current price data, so then why include old data at all?

Additionally, you’ll notice that most of the parameters and meta-parameters of this training and prediction function seem chosen in fairly arbitrary fashion (like only looking back 3 days, the number of nodes in the network, etc.). That’s because, quite frankly, they are abitrary. To get good values for these, I think you would really need to create validation curves for each of these parameters. I will probably end up doing this. I don’t think, however, I’ll include that as a blog entry, especially not for this one anyway.

random_seed = int(time.time()) Seed_RNG(random_seed) num_of_iterations = 250 hidden_layers = 1 lookback_window = 3 daily_residuals = [-1.0 for i in range(lookback_window)] weekly_residuals = [-1.0 for i in range(lookback_window)] for i in range(lookback_window,len(spy_df["tomorrow_close"])): #no_of_inputs,no_of_hidden_layers,nodes_in_hiddens,no_of_outputs,learning_rate stock_net = FeedForwardNet(3,1,[7],2,0.35) for iterations in range(num_of_iterations): for x in range(lookback_window-1): idx = i - lookback_window + x training_vector = np.array([float(spy_df["Open_scaled"][idx]),float(spy_df["Close_scaled"][idx]), float(spy_df["Volume_scaled"][idx])]) training_output = [float(spy_df["tomorrow_close"][idx]),float(spy_df["week_close"][idx])] stock_net.FeedForward(training_vector,training_output,Training=True) pred_vector = [float(spy_df["Open_scaled"][idx]),float(spy_df["Close_scaled"][i]),float(spy_df["Volume_scaled"][i])] pred_closes = stock_net.FeedForward(pred_vector) if pred_closes[0] > spy_df["Close_scaled"][i]: daily_residuals.append(1) #print "here" else: daily_residuals.append(0) if pred_closes[1] > spy_df["Close_scaled"][i]: weekly_residuals.append(1) else: weekly_residuals.append(0) if (i % 100 == 0): print i, "iterations" print len(weekly_residuals) spy_df["daily_residuals"] = daily_residuals spy_df["weekly_residuals"] = weekly_residuals spy_df.info() spy_df.head()

RANDOM SEED: 1484778788 100 iterations 200 iterations 300 iterations 400 iterations 500 iterations 600 iterations 700 iterations 800 iterations 900 iterations 973 Int64Index: 973 entries, 0 to 972 Data columns (total 17 columns): Adj_Close 973 non-null object Close 973 non-null float64 Date 973 non-null object High 973 non-null object Low 973 non-null object Open 973 non-null float64 Symbol 973 non-null object Volume 973 non-null int64 Open_scaled 973 non-null float64 Close_scaled 973 non-null float64 Volume_scaled 973 non-null float64 tomorrow_close 973 non-null float64 week_close 973 non-null float64 tomorrow_up 973 non-null float64 week_up 973 non-null float64 daily_residuals 973 non-null float64 weekly_residuals 973 non-null float64 dtypes: float64(11), int64(1), object(5) memory usage: 136.8+ KB

Adj_Close Close Date High Low Open Symbol Volume Open_scaled Close_scaled Volume_scaled tomorrow_close week_close tomorrow_up week_up daily_residuals weekly_residuals 0 216.593374 217.869995 2016-11-16 218.139999 217.419998 217.559998 SPY 65617700 -0.992043 -0.975247 -0.280858 -0.819481 -0.820529 1.0 1.0 -1.0 -1.0 1 217.000975 218.279999 2016-11-15 218.279999 216.800003 217.039993 SPY 91652600 -0.985221 -0.819481 0.189329 -1.030917 -1.078715 0.0 0.0 -1.0 -1.0 2 215.320876 216.589996 2016-11-14 217.270004 215.720001 217.029999 SPY 94580000 -0.895887 -1.030917 0.219208 -0.957311 -1.105524 1.0 0.0 -1.0 -1.0 3 215.151874 216.419998 2016-11-11 216.699997 215.320007 216.080002 SPY 100552700 -0.969539 -0.957311 0.306738 -0.802213 -1.597586 1.0 0.0 1.0 1.0 4 215.648944 216.919998 2016-11-10 218.309998 215.220001 217.300003 SPY 172113300 -0.707726 -0.802213 1.601762 -0.820529 -1.325594 0.0 0.0 1.0 1.0

Initial Performance

Here’s how the network did for both the daily predictions and weekly predictions.

print "STATS FOR DAILY PREDICTIONS" error_count = 0 false_pos_count = 0 false_neg_count = 0 up_preds = 0 down_preds = 0 for i in range (0,len(spy_df["tomorrow_up"])): if spy_df["daily_residuals"][i] != spy_df["tomorrow_up"][i]: error_count += 1 if spy_df["daily_residuals"][i] > spy_df["tomorrow_up"][i]: false_pos_count += 1 if spy_df["daily_residuals"][i] < spy_df["tomorrow_up"][i]: false_neg_count += 1 if spy_df["daily_residuals"][i] == 1.0: up_preds += 1 else: down_preds += 1 error_rate = error_count / float(len(spy_df["daily_residuals"])) false_pos_rate = false_pos_count / float(up_preds) false_neg_rate = false_neg_count / float(down_preds) print error_count, len(spy_df["daily_residuals"]) print "Error rate: ", error_rate print "Accuracy : ", 1 - error_rate print "False Positive Rate: ", false_pos_rate print "False Negative Rate: ", false_neg_rate print "Count of upward predictions: ", up_preds print "Count of down predictions:", down_preds

STATS FOR DAILY PREDICTIONS 484 973 Error rate: 0.497430626927 Accuracy : 0.502569373073 False Positive Rate: 0.506922257721 False Negative Rate: 0.235294117647 Count of upward predictions: 939 Count of down predictions: 34

print "STATS FOR WEEKLY PREDICTIONS" error_count = 0 false_pos_count = 0 false_neg_count = 0 up_preds = 0 down_preds = 0 for i in range (0,len(spy_df["week_up"])): if spy_df["weekly_residuals"][i] != spy_df["week_up"][i]: error_count += 1 if spy_df["weekly_residuals"][i] > spy_df["week_up"][i]: false_pos_count += 1 if spy_df["weekly_residuals"][i] < spy_df["week_up"][i]: false_neg_count += 1 if spy_df["weekly_residuals"][i] == 1.0: up_preds += 1 else: down_preds += 1 error_rate = error_count / float(len(spy_df["weekly_residuals"])) false_pos_rate = false_pos_count / float(up_preds) false_neg_rate = false_neg_count / float(down_preds) print error_count, len(spy_df["weekly_residuals"]) print "Error rate: ", error_rate print "Accuracy : ", 1 - error_rate print "False Positive Rate: ", false_pos_rate print "False Negative Rate: ", false_neg_rate print "Count of upward predictions: ", up_preds print "Count of down predictions:", down_preds

STATS FOR WEEKLY PREDICTIONS 332 973 Error rate: 0.34121274409 Accuracy : 0.65878725591 False Positive Rate: 0.410666666667 False Negative Rate: 0.107623318386 Count of upward predictions: 750 Count of down predictions: 223

So attempting to predict daily Up/Downs is a wash – with an error rate at roughly 50%. The weekly predictions, I think, are far more interesting, with an error rate down at 34%. This indicates that you get a correct prediction every 2 out of 3 times. It’s also very interesting to note that the false negative rate of the predictions is very low, at just under 10%.

Some caveats to the above results are that we’ve only included data from the past 4 years in this analysis, so the results above may be not be indicative of trying to forecast other time periods, i.e. is this method (and the parameters selected) merely optimized for this specific set of data. Additionally, the very low false positive rate for the weekly predictions is based on only 200+ samples, so that low rate may not hold up as more “downs” are predicted.

“Canning” This Process

To explore this process further, I think it makes sense to create functions out of the above code, so that this process can be repreated any number of times with any arbitrary ticker symbol, any start and stop dates, and with arbitrary network and training parameters. The following four functions below help do this. The first function does the normalizing of the variables: Open, Close, and Volume. The second function will add variables to the dataset for predicting out the input number of days. The third function is essentially the training/predicting process that was run above for SPY. The final function then provides some basic metrics for how well the network was able to make classifications.

def getAndSmoothData(symbol,start_day,end_day,lookback_window): df = getHistoricalData(symbol,start,end) df[['Open','Close','Volume']] = df[['Open','Close','Volume']].apply(pd.to_numeric) scale_on_lookback_window(lookback_window,"Open",df) scale_on_lookback_window(lookback_window,"Close",df) scale_on_lookback_window(lookback_window,"Volume",df) df.drop(df.index[0:scale_window], inplace=True) df.reset_index(drop=True,inplace=True) return df def addForecastingVariable(days_to_predict,df,var_to_predict): var_name = "%s_%d_day_forecast" % (var_to_predict,days_to_predict) #print var_name df[var_name] = np.nan df[(var_name + "_up")] = np.nan for i in range(days_to_predict,len(df[var_name])): df[var_name][i-days_to_predict] = df[var_to_predict][i] if df[var_to_predict][i] > df[var_to_predict][i-days_to_predict]: df[(var_name + "_up")][i-days_to_predict] = 1 else: df[(var_name + "_up")][i-days_to_predict] = 0 df.drop(df.index[-days_to_predict:], inplace=True) df.reset_index(drop=True,inplace=True) #print df.info() #print df.head()

def run_data(df,input_vars,output_vars,hidden_nodes_list,learning_rate,training_window,training_iterations): residuals = [] print len(df["Close_scaled"]) for x in range(len(output_vars)): residuals.append([-1.0 for i in range(training_window)]) for i in range(training_window,len(df["Close_scaled"])): #no_of_inputs,no_of_hidden_layers,nodes_in_hiddens,no_of_outputs,learning_rate stock_net = FeedForwardNet(len(input_vars),len(hidden_nodes_list),hidden_nodes_list, len(output_vars),learning_rate) for iterations in range(training_iterations): for x in range(training_window-1): idx = i - training_window + x training_list = [] for var in input_vars: training_list.append(float(df[var][idx])) training_vector = np.array(training_list) out_list = [] for var in output_vars: out_list.append(float(df[var][idx])) training_output = np.array(out_list) stock_net.FeedForward(training_vector,training_output,Training=True) pred_list = [] for var in input_vars: pred_list.append(float(df[var][idx])) pred_vector = np.array(pred_list) pred_closes = stock_net.FeedForward(pred_vector) for x in range(len(output_vars)): #print len(residuals[x]) if pred_closes[x] > df["Close_scaled"][i]: residuals[x].append(1) else: residuals[x].append(0) for x in range(len(output_vars)): df[(output_vars[x] + "_residuals")] = residuals[x]

def Forecast_Stats(df,residuals,actuals): print "STATS FOR", residuals error_count = 0 false_pos_count = 0 false_neg_count = 0 up_preds = 0 down_preds = 0 for i in range (0,len(df[actuals])): if df[residuals][i] != df[actuals][i]: error_count += 1 if df[residuals][i] > df[actuals][i]: false_pos_count += 1 if df[residuals][i] < df[actuals][i]: false_neg_count += 1 if df[residuals][i] == 1.0: up_preds += 1 else: down_preds += 1 error_rate = error_count / float(len(df[residuals])) false_pos_rate = false_pos_count / float(up_preds) false_neg_rate = false_neg_count / float(down_preds) print error_count, len(df[residuals]) print "Error rate: ", error_rate print "Accuracy : ", 1 - error_rate print "False Positive Rate: ", false_pos_rate print "False Negative Rate: ", false_neg_rate print "Count of upward predictions: ", up_preds print "Count of down predictions:", down_preds

With these functions, I think it would be intersting to try out some different stocks and ETFs and see how well the network is able to make correct predictions. Again, like the SPY analysis I did above, there are a lot of abitrary parameters used here that could (and in all honestly should) be optimized.

In addition, I’m also adding a 10-day prediction as well. Since the weekly predictions were more accurate than the daily predictions, maybe that means the network is able to pick up more easily on longer-term trends.

ticker_symbols = ["FAS","BAC","JNUG","DUST","GOOG","TQQQ","ANGL","CHK","WMT"] start = "2013-01-01" end = "2017-01-01" for ticker_symbol in ticker_symbols: print "---------------" + ticker_symbol + "-----------------------------------" data = getAndSmoothData(ticker_symbol,start,end,15) addForecastingVariable(1,data,"Close_scaled") addForecastingVariable(5,data,"Close_scaled") addForecastingVariable(10,data,"Close_scaled") input_variables = ["Open_scaled","Close_scaled","Volume_scaled"] output_vars = ["Close_scaled_1_day_forecast","Close_scaled_5_day_forecast","Close_scaled_10_day_forecast"] #print data.info() #print data.head() run_data(data,input_variables,output_vars,[7,5],0.25,10,100) for output in output_vars: Forecast_Stats(data,(output + "_residuals"),(output + "_up"))

E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:19: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:21: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:23: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy ---------------FAS----------------------------------- 962 STATS FOR Close_scaled_1_day_forecast_residuals 477 962 Error rate: 0.495841995842 Accuracy : 0.504158004158 False Positive Rate: 0.514318442153 False Negative Rate: 0.314606741573 Count of upward predictions: 873 Count of down predictions: 89 STATS FOR Close_scaled_5_day_forecast_residuals 424 962 Error rate: 0.440748440748 Accuracy : 0.559251559252 False Positive Rate: 0.467963386728 False Negative Rate: 0.170454545455 Count of upward predictions: 874 Count of down predictions: 88 STATS FOR Close_scaled_10_day_forecast_residuals 227 962 Error rate: 0.235966735967 Accuracy : 0.764033264033 False Positive Rate: 0.282051282051 False Negative Rate: 0.150887573964 Count of upward predictions: 624 Count of down predictions: 338 ---------------BAC----------------------------------- 962 STATS FOR Close_scaled_1_day_forecast_residuals 410 962 Error rate: 0.426195426195 Accuracy : 0.573804573805 False Positive Rate: 0.447306791569 False Negative Rate: 0.259259259259 Count of upward predictions: 854 Count of down predictions: 108 STATS FOR Close_scaled_5_day_forecast_residuals 381 962 Error rate: 0.39604989605 Accuracy : 0.60395010395 False Positive Rate: 0.429078014184 False Negative Rate: 0.155172413793 Count of upward predictions: 846 Count of down predictions: 116 STATS FOR Close_scaled_10_day_forecast_residuals 251 962 Error rate: 0.260914760915 Accuracy : 0.739085239085 False Positive Rate: 0.30407523511 False Negative Rate: 0.175925925926 Count of upward predictions: 638 Count of down predictions: 324 ---------------JNUG----------------------------------- 772 STATS FOR Close_scaled_1_day_forecast_residuals 318 772 Error rate: 0.411917098446 Accuracy : 0.588082901554 False Positive Rate: 0.450079239303 False Negative Rate: 0.241134751773 Count of upward predictions: 631 Count of down predictions: 141 STATS FOR Close_scaled_5_day_forecast_residuals 288 772 Error rate: 0.373056994819 Accuracy : 0.626943005181 False Positive Rate: 0.4224 False Negative Rate: 0.163265306122 Count of upward predictions: 625 Count of down predictions: 147 STATS FOR Close_scaled_10_day_forecast_residuals 206 772 Error rate: 0.266839378238 Accuracy : 0.733160621762 False Positive Rate: 0.312629399586 False Negative Rate: 0.190311418685 Count of upward predictions: 483 Count of down predictions: 289 ---------------DUST----------------------------------- 962 STATS FOR Close_scaled_1_day_forecast_residuals 404 962 Error rate: 0.419958419958 Accuracy : 0.580041580042 False Positive Rate: 0.445442875481 False Negative Rate: 0.311475409836 Count of upward predictions: 779 Count of down predictions: 183 STATS FOR Close_scaled_5_day_forecast_residuals 343 962 Error rate: 0.356548856549 Accuracy : 0.643451143451 False Positive Rate: 0.402313624679 False Negative Rate: 0.163043478261 Count of upward predictions: 778 Count of down predictions: 184 STATS FOR Close_scaled_10_day_forecast_residuals 218 962 Error rate: 0.226611226611 Accuracy : 0.773388773389 False Positive Rate: 0.273770491803 False Negative Rate: 0.144886363636 Count of upward predictions: 610 Count of down predictions: 352 ---------------GOOG----------------------------------- 962 STATS FOR Close_scaled_1_day_forecast_residuals 412 962 Error rate: 0.428274428274 Accuracy : 0.571725571726 False Positive Rate: 0.447058823529 False Negative Rate: 0.285714285714 Count of upward predictions: 850 Count of down predictions: 112 STATS FOR Close_scaled_5_day_forecast_residuals 421 962 Error rate: 0.43762993763 Accuracy : 0.56237006237 False Positive Rate: 0.473004694836 False Negative Rate: 0.163636363636 Count of upward predictions: 852 Count of down predictions: 110 STATS FOR Close_scaled_10_day_forecast_residuals 253 962 Error rate: 0.262993762994 Accuracy : 0.737006237006 False Positive Rate: 0.318529862175 False Negative Rate: 0.145631067961 Count of upward predictions: 653 Count of down predictions: 309 ---------------TQQQ----------------------------------- 962 STATS FOR Close_scaled_1_day_forecast_residuals 464 962 Error rate: 0.482328482328 Accuracy : 0.517671517672 False Positive Rate: 0.492737430168 False Negative Rate: 0.34328358209 Count of upward predictions: 895 Count of down predictions: 67 STATS FOR Close_scaled_5_day_forecast_residuals 436 962 Error rate: 0.453222453222 Accuracy : 0.546777546778 False Positive Rate: 0.479190101237 False Negative Rate: 0.13698630137 Count of upward predictions: 889 Count of down predictions: 73 STATS FOR Close_scaled_10_day_forecast_residuals 269 962 Error rate: 0.279625779626 Accuracy : 0.720374220374 False Positive Rate: 0.335375191424 False Negative Rate: 0.161812297735 Count of upward predictions: 653 Count of down predictions: 309 ---------------ANGL----------------------------------- 962 STATS FOR Close_scaled_1_day_forecast_residuals 454 962 Error rate: 0.471933471933 Accuracy : 0.528066528067 False Positive Rate: 0.491228070175 False Negative Rate: 0.317757009346 Count of upward predictions: 855 Count of down predictions: 107 STATS FOR Close_scaled_5_day_forecast_residuals 419 962 Error rate: 0.435550935551 Accuracy : 0.564449064449 False Positive Rate: 0.469964664311 False Negative Rate: 0.176991150442 Count of upward predictions: 849 Count of down predictions: 113 STATS FOR Close_scaled_10_day_forecast_residuals 270 962 Error rate: 0.280665280665 Accuracy : 0.719334719335 False Positive Rate: 0.337060702875 False Negative Rate: 0.175595238095 Count of upward predictions: 626 Count of down predictions: 336 ---------------CHK----------------------------------- 962 STATS FOR Close_scaled_1_day_forecast_residuals 400 962 Error rate: 0.4158004158 Accuracy : 0.5841995842 False Positive Rate: 0.44099378882 False Negative Rate: 0.286624203822 Count of upward predictions: 805 Count of down predictions: 157 STATS FOR Close_scaled_5_day_forecast_residuals 370 962 Error rate: 0.384615384615 Accuracy : 0.615384615385 False Positive Rate: 0.425373134328 False Negative Rate: 0.177215189873 Count of upward predictions: 804 Count of down predictions: 158 STATS FOR Close_scaled_10_day_forecast_residuals 246 962 Error rate: 0.255717255717 Accuracy : 0.744282744283 False Positive Rate: 0.313114754098 False Negative Rate: 0.15625 Count of upward predictions: 610 Count of down predictions: 352 ---------------WMT----------------------------------- 962 STATS FOR Close_scaled_1_day_forecast_residuals 439 962 Error rate: 0.456340956341 Accuracy : 0.543659043659 False Positive Rate: 0.481087470449 False Negative Rate: 0.275862068966 Count of upward predictions: 846 Count of down predictions: 116 STATS FOR Close_scaled_5_day_forecast_residuals 407 962 Error rate: 0.423076923077 Accuracy : 0.576923076923 False Positive Rate: 0.459976105137 False Negative Rate: 0.176 Count of upward predictions: 837 Count of down predictions: 125 STATS FOR Close_scaled_10_day_forecast_residuals 258 962 Error rate: 0.268191268191 Accuracy : 0.731808731809 False Positive Rate: 0.320512820513 False Negative Rate: 0.171597633136 Count of upward predictions: 624 Count of down predictions: 338

So attempting to predict if a stock is going to rise over the next two weeks ended up being even more accurate, with accuracy above 70% for all the stocks and ETFs in the list above, which is very very interesting. Again, this may be a product of looking at this particular time period. Another thing to consider is how well this perfomance stacks up against other modelling and forecasting techniques. For instance, would you be able to get as good or better performance using a linear model, or merely simply predicting “UP” everytime? The greater than 70% does seem really promising, but maybe there are benchmarks out there that do better.

One more thing I feel that I should point out is that this isn’t really a strategy for making trades. I think that these predictions would most likely be best used to augment or help some strategy, or maybe be used to screen stocks out stocks that are trending down (unless you’re looking for stocks to short that is). Basically, I see this more as a tool to aid a strategy. I could be wrong though, I’m by no means an expert.

This post and the past two are really the bulk of what I wanted to cover in the blog. I think I may write one more entry with some evaluation of how this method compares performance-wise to other modelling techniques, or I may start a new series of posts about something else; I haven’t really decided yet. If there’s any topic within this idea of using a neural network with financial data you’d like to see, feel free to comment and let me know.

Thanks for reading! Please feel free to post any questions/comments/bug fixes!