Stock Price Data Analysis – 1
Introduction
I recently was able to get my hands on raw stock prices data (open, high, low, close and volume) for the Standard And Poor’s 500 index (SNP500). The data was for each stock in the SNP500 bucket (actually 504 stocks since some of the stocks are listed twice for different types of shares issued) at two types of intervals:
- Daily (from each market open date since Jan 1 to Dec 15 of 2016)
- Per minute for each of the 390 minutes the market is open for the past 10 days since 17 Dec ’16.
This gave me an opportunity to look at different trends in the data and to see if something of use could be gleaned from it.
Daily Tickers
Let’s first import our dependencies before going any further. The module data_libs contains some helper functions. They are available on my github here.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.append('../data_libs/')
from data_libs import correlate
#this is to plot everything within jupyter
%matplotlib inline
import os
#the list of stocks
stocks=["MMM","ABT","ABBV","ACN","ATVI","AYI","ADBE","AAP","AES","AET","AMG","AFL","A","APD","AKAM","ALK","ALB","AA","ALXN","ALLE","AGN","ADS","LNT","ALL","GOOGL","GOOG","MO","AMZN","AEE","AAL","AEP","AXP","AIG","AMT","AWK","AMP","ABC","AME","AMGN","APH","APC","ADI","ANTM","AON","APA","AIV","AAPL","AMAT","ADM","AJG","AIZ","T","ADSK","ADP","AN","AZO","AVGO","AVB","AVY","BHI","BLL","BAC","BCR","BAX","BBT","BDX","BBBY","BRK-B","BBY","BIIB","BLK","HRB","BA","BWA","BXP","BSX","BMY","BF-B","CHRW","CA","COG","CPB","COF","CAH","KMX","CCL","CAT","CBG","CBS","CELG","CNC","CNP","CTL","CERN","CF","SCHW","CHK","CVX","CMG","CB","CHD","CI","XEC","CINF","CTAS","CSCO","C","CFG","CTXS","CME","CMS","COH","CTSH","CL","CMCSA","CMA","CAG","CXO","COP","ED","STZ","GLW","COST","CCI","CSRA","CSX","CMI","CVS","DHI","DHR","DRI","DVA","DE","DLPH","DAL","XRAY","DVN","DO","DLR","DFS","DISCA","DISCK","DG","DLTR","D","DOV","DOW","DPS","DTE","DD","DUK","DNB","ETFC","EMN","ETN","EBAY","ECL","EIX","EW","EA","EMC","EMR","ENDP","ETR","EOG","EQT","EFX","EQIX","EQR","ESS","EL","ES","EXC","EXPE","EXPD","ESRX","EXR","XOM","FFIV","FB","FAST","FRT","FDX","FIS","FITB","FSLR","FE","FISV","FLIR","FLS","FLR","FMC","FTI","FL","F","FTV","FBHS","BEN","FCX","FTR","GPS","GRMN","GD","GE","GGP","GIS","GM","GPC","GILD","GPN","GS","GT","GWW","HAL","HBI","HOG","HAR","HRS","HIG","HAS","HCA","HCP","HP","HSIC","HES","HPE","HOLX","HD","HON","HRL","HST","HPQ","HUM","HBAN","ITW","ILMN","IR","INTC","ICE","IBM","IP","IPG","IFF","INTU","ISRG","IVZ","IRM","JBHT","JEC","JNJ","JCI","JPM","JNPR","KSU","K","KEY","KMB","KIM","KMI","KLAC","KSS","KHC","KR","LB","LLL","LH","LRCX","LM","LEG","LEN","LUK","LVLT","LLY","LNC","LLTC","LKQ","LMT","L","LOW","LYB","MTB","MAC","M","MNK","MRO","MPC","MAR","MMC","MLM","MAS","MA","MAT","MKC","MCD","MCK","MJN","MDT","MRK","MET","KORS","MCHP","MU","MSFT","MHK","TAP","MDLZ","MON","MNST","MCO","MS","MSI","MUR","MYL","NDAQ","NOV","NAVI","NTAP","NFLX","NWL","NFX","NEM","NWSA","NWS","NEE","NLSN","NKE","NI","NBL","JWN","NSC","NTRS","NOC","NRG","NUE","NVDA","ORLY","OXY","OMC","OKE","ORCL","OI","PCAR","PH","PDCO","PAYX","PYPL","PNR","PBCT","PEP","PKI","PRGO","PFE","PCG","PM","PSX","PNW","PXD","PBI","PNC","RL","PPG","PPL","PX","PCLN","PFG","PG","PGR","PLD","PRU","PEG","PSA","PHM","PVH","QRVO","QCOM","PWR","DGX","RRC","RTN","O","RHT","REGN","RF","RSG","RAI","RHI","ROK","COL","ROP","ROST","RCL","R","SPGI","CRM","SCG","SLB","SNI","STX","SEE","SRE","SHW","SIG","SPG","SWKS","SLG","SJM","SNA","SO","LUV","SWN","SE","STJ","SWK","SPLS","SBUX","HOT","STT","SRCL","SYK","STI","SYMC","SYF","SYY","TROW","TGT","TEL","TGNA","TDC","TSO","TXN","TXT","BK","CLX","KO","HSY","MOS","TRV","DIS","TMO","TIF","TWX","TJX","TMK","TSS","TSCO","TDG","RIG","TRIP","FOXA","FOX","TYC","TSN","USB","UDR","ULTA","UA","UNP","UAL","UNH","UPS","URI","UTX","UHS","UNM","URBN","VFC","VLO","VAR","VTR","VRSN","VRSK","VZ","VRTX","VIAB","V","VNO","VMC","WMT","WBA","WM","WAT","WFC","HCN","WDC","WU","WRK","WY","WHR","WFM","WMB","WLTW","WEC","WYN","WYNN","XEL","XRX","XLNX","XL","XYL","YHOO","YUM","ZBH","ZION","ZTS"]
stock_pd = {}
for stock in stocks:
path = os.getcwd() + '/../data/'+stock+'.csv'
data = pd.read_csv(path)
stock_pd[stock] = data
Now that we have all the csv data in memory, it needs some sanitisation. We know there were 242 trading days this year until 15 Dec, and we are only interested in the opening prices, so we only select those ones and add them to our new DataFrame. A DataFrame is useful since we can perform various funtions on it without much effort.
openings = {}
for key, val in stock_pd.items():
if len(list(val['Open'])) == 242:
openings[key]=list(val['Open'])
stocks_df = pd.DataFrame.from_dict(openings)
stocks = list(stocks_df.columns.values)
Let’s look at some basic things:
stocks_df.shape
(242, 497)
stocks_df.describe()
A | AAL | AAP | AAPL | ABBV | ABC | ABT | ACN | ADBE | ADI | … | XLNX | XOM | XRAY | XRX | XYL | YHOO | YUM | ZBH | ZION | ZTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | … | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 | 242.000000 |
mean | 43.225248 | 37.728760 | 154.684711 | 104.012686 | 60.528388 | 83.420124 | 40.572190 | 112.493099 | 96.978636 | 59.616570 | … | 48.942231 | 85.933471 | 60.299876 | 9.791860 | 44.808512 | 37.530950 | 79.313886 | 113.993141 | 28.066364 | 47.603017 |
std | 3.659674 | 4.671148 | 9.182969 | 7.329680 | 3.666874 | 6.205344 | 2.238767 | 6.842154 | 7.806769 | 5.806182 | … | 3.527123 | 4.654318 | 2.129507 | 0.545279 | 5.778593 | 4.472494 | 9.337674 | 12.127044 | 5.055868 | 3.604124 |
min | 34.209999 | 26.120001 | 134.119995 | 90.000000 | 51.320000 | 69.129997 | 36.169998 | 92.540001 | 74.040001 | 48.160000 | … | 41.049999 | 73.160004 | 54.049999 | 8.720000 | 32.439999 | 26.459999 | 59.889999 | 88.449997 | 20.100000 | 38.980000 |
25% | 39.892500 | 34.992502 | 147.997506 | 97.177498 | 57.549999 | 78.437498 | 38.804999 | 110.382497 | 92.740001 | 55.694999 | … | 46.565001 | 83.300003 | 59.077500 | 9.440000 | 40.992502 | 35.822501 | 71.312502 | 102.287498 | 24.672500 | 45.932501 |
50% | 44.645001 | 37.825001 | 154.165001 | 104.849998 | 60.959999 | 84.654999 | 40.280000 | 114.340001 | 97.215000 | 58.780000 | … | 48.020001 | 86.834999 | 60.320002 | 9.720000 | 45.760000 | 37.470001 | 81.700003 | 115.960003 | 27.144999 | 48.030001 |
75% | 46.307501 | 40.767499 | 161.020004 | 109.862497 | 63.382500 | 87.537502 | 42.250000 | 117.080002 | 102.634999 | 63.585000 | … | 51.980002 | 88.637499 | 61.707500 | 9.967500 | 49.997501 | 41.439999 | 87.507503 | 125.337498 | 30.587501 | 50.675000 |
max | 48.630001 | 49.270000 | 176.410004 | 118.180000 | 67.470001 | 102.309998 | 45.720001 | 125.250000 | 110.760002 | 74.540001 | … | 58.290001 | 95.440002 | 64.889999 | 11.270000 | 54.730000 | 44.549999 | 91.590006 | 132.309998 | 43.160000 | 52.430000 |
8 rows × 497 columns
So we have 242 rows of opening prices of 497 stocks. Let’s select one stock and do some analysis.
Apple (AAPL)
Let’s take a look at the correlations of APPL with the other stocks.
#first ten correlated stocks - we copy the stocks list to remove apple without modifying the all stocks list we have
stocks_to_compare = list(stocks)
stocks_to_compare.remove('AAPL')
apple_correlations = correlate.correlate(stocks_df,"AAPL", stocks_to_compare)
apple_correlations[:10]
A 0.426812
AAL 0.508813
AAP 0.129514
ABBV 0.307767
ABC -0.067380
ABT 0.449754
ACN 0.436478
ADBE 0.713172
ADI 0.738866
ADM 0.470422
dtype: float64
It would more interesting to see the stocks most correlated to AAPL.
apple_most_corr = apple_correlations.sort_values()[len(apple_correlations)-15:]
print (apple_most_corr)
STX 0.753511
URBN 0.762356
LLTC 0.763457
QCOM 0.765870
LRCX 0.768761
HOLX 0.770554
ADSK 0.777759
INTC 0.787496
ROST 0.796642
GLW 0.804015
NTAP 0.829960
MSFT 0.858131
SWKS 0.870930
GOOG 0.882116
GOOGL 0.890361
dtype: float64
plt.plot(list(apple_most_corr))
plt.xticks(np.arange(15),list(apple_most_corr.index))
These are the 10 mose correlated stocks to Apple over the past year. GOOGL (google) seems to be very much correlated to Apple, so does MSFT (microsoft). There are, however, a few unusual ones, such as ROST (Ross Stores which is a departmental stores brand, and URBN (Urban Outfitters). Usually, tech companies are correlated as a whole, and have similar movements over the long run. But seeing URBN and ROST is a bit strange. Let’s take a deeper look.
twenty_day_moving_corr = correlate.moving_corrs(stocks_df, "AAPL", "ROST", 20)
plt.plot(list(twenty_day_moving_corr))
As we can see the correlation for ROST varies wildly over a 20 day moving correlation window.
twenty_day_moving_corr = correlate.moving_corrs(stocks_df, "AAPL", "URBN", 20)
plt.plot(list(twenty_day_moving_corr))
URBN is a bit different. It switched it’s relationship about 40 days into the year, and rose in the positive relationship with APPL since. Let’s take a look at their stock price graphs.
plt.plot(stocks_df["URBN"])
plt.plot(stocks_df["AAPL"])
AAPL in green and URBN in black. As we can see there is some similar movements in the prices. There could be another variable which might effect both of these, or the similarity in movement could be purely coincidental. It is difficult to say.
Regression Model – Reducing Predictors
Let’s assume we can predict AAPL’s day n+1 stock prices based on the rest of the stock’s prices on day n.
We currently have way too many predictors in our dataframe. If we want to make a model to predict the price of AAPL stock based on the rest of the stocks, we need to reduce the number of predictors.
from sklearn.decomposition import PCA
stocks_df_without_apple = stocks_df.copy()
del stocks_df_without_apple["AAPL"]
pca = PCA(n_components=3)
pca.fit(stocks_df_without_apple)
PCA(copy=True, n_components=3, whiten=False)
print(pca.explained_variance_ratio_)
[ 0.63283094 0.14793351 0.06302373]
We now have three principle components. Interpreting these components requires a different post, but the idea is to reduce dimensions of the data in a way that the resulting dimensions are orthogonal (i.e. they have zero correlation with each other).
Let’s transform our data using the PCA which we can then do a regression on.
stocks_pca = pca.transform(stocks_df_without_apple)
stocks_pca.shape
(242, 3)
Let’s first normalise the data to get the timeshift we need.
from data_libs import lag
lagged_stocks_pca = lag.shift(stocks_pca, -1)
lagged_apple = lag.shift(stocks_df["AAPL"],1)
from sklearn.linear_model import LinearRegression as LR
model = LR()
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(lagged_stocks_pca, lagged_apple, test_size=0.3, random_state=0)
model.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
model.score(X_test, y_test)
0.65330220629496782
The model score gives the R squared of the model. An R squared of 0.65 means that our model explains 65% of the variance in the price of Apple stock. This is pretty dismal. Let’s take a look at the root mean squared error for the predictions in the test set.
from sklearn import metrics
np.sqrt(metrics.mean_squared_error(y_test,model.predict(X_test)))
4.3187068988897206
This means our predicted values vary about $4 from the actual stock prices. This means the model is of little use.
plt.plot(lagged_apple, color="RED")
plt.plot(model.predict(lagged_stocks_pca),color="BLACK")
As can be seen here, our model does predict the general trend correctly, but it is still not accurate. This is probably because we have very little data points that we are working with. Let’s take a look at the minute ticker data in the next post.