In this tutorial I'll walk you through a simple methodology to correlate various stocks against each other. We'll grab the prices of the selected stocks using python, drop them into a clean dataframe, run a correlation, and visualize our results.

(1) Import libraries, select our list of stocks to correlate

import numpy as np import pandas as pd #used to grab the stock prices, with yahoo import pandas_datareader as web from datetime import datetime #to visualize the results import matplotlib.pyplot as plt import seaborn #select start date for correlation window as well as list of tickers start = datetime(2017, 1, 1) symbols_list = ['AAPL', 'F', 'TWTR', 'FB', 'AAL', 'AMZN', 'GOOGL', 'GE']

(2) Pull stock prices, push into clean dataframe

#array to store prices symbols=[] #pull price using iex for each symbol in list defined above for ticker in symbols_list: r = web.DataReader(ticker, 'yahoo', start) # add a symbol column r['Symbol'] = ticker symbols.append(r) # concatenate into df df = pd.concat(symbols) df = df.reset_index() df = df[['Date', 'Close', 'Symbol']] df.head()

We're now left with a table in this format:

date close symbol 2017-01-03 113.4101 AAPL 2017-01-04 113.2832 AAPL 2017-01-05 113.8593 AAPL

3) However, we want our symbols represented as columns so we'll have to pivot the dataframe:

df_pivot = df.pivot('date','symbol','close').reset_index() df_pivot.head()

date AAL AAPL AMZN F FB GE GOOGL TWTR 0 2017-01-03 45.7092 113.4101 753.67 11.4841 116.86 30.1370 808.01 16.44 1 2017-01-04 46.1041 113.2832 757.18 12.0132 118.69 30.1465 807.77 16.86 2 2017-01-05 45.3044 113.8593 780.45 11.6483 120.67 29.9754 813.02 17.09 3 2017-01-06 45.6203 115.1286 795.99 11.6392 123.41 30.0610 825.21 17.17 4 2017-01-09 46.4792 116.1832 796.92 11.5206 124.90 29.9183 827.18 17.50

(4) Next, we can run the correlation. Using the Pandas 'corr' function to compute the Pearson correlation coeffecient between each pair of equities

corr_df = df_pivot.corr(method='pearson') #reset symbol as index (rather than 0-X) corr_df.head().reset_index() del corr_df.index.name corr_df.head(10)

symbol AAL AAPL AMZN F FB GE GOOGL TWTR AAL 1.000000 0.235239 0.226061 0.068024 0.356063 -0.332681 0.494075 0.169487 AAPL 0.235239 1.000000 0.868763 0.184501 0.911380 -0.895210 0.903191 0.781755 AMZN 0.226061 0.868763 1.000000 0.108351 0.744732 -0.937415 0.864455 0.955373 F 0.068024 0.184501 0.108351 1.000000 0.206055 -0.216064 0.189753 0.161078 FB 0.356063 0.911380 0.744732 0.206055 1.000000 -0.814703 0.900033 0.650404 GE -0.332681 -0.895210 -0.937415 -0.216064 -0.814703 1.000000 -0.882526 -0.866871 GOOGL 0.494075 0.903191 0.864455 0.189753 0.900033 -0.882526 1.000000 0.789379 TWTR 0.169487 0.781755 0.955373 0.161078 0.650404 -0.866871 0.789379 1.000000

5) Finally, we can plot a heatmap of the correlations (with Seaborn and Matplotlib) to better visualize the results:

#take the bottom triangle since it repeats itself mask = np.zeros_like(corr_df) mask[np.triu_indices_from(mask)] = True #generate plot seaborn.heatmap(corr_df, cmap='RdYlGn', vmax=1.0, vmin=-1.0 , mask = mask, linewidths=2.5) plt.yticks(rotation=0) plt.xticks(rotation=90) plt.show()

Here we can see that, as expected, the tech companies are generally pretty highly correlated (as indicated by dark green -- AMZN/GOOGL, FB/AAPL, etc). Conversely, Ford (F) and General Electric (GE) are either not correlated or negatively correlated with the rest of the group.

Interested in practicing for data scientist or analyst interviews?

We send 3 questions each week to thousands of data scientists and analysts preparing for interviews or just keeping their skills sharp. You can sign up to receive the questions for free on our home page.