Stock Price Data Analysis – 1

By taha No comments

Introduction

I recently was able to get my hands on raw stock prices data (open, high, low, close and volume) for the Standard And Poor’s 500 index (SNP500). The data was for each stock in the SNP500 bucket (actually 504 stocks since some of the stocks are listed twice for different types of shares issued) at two types of intervals:

  1. Daily (from each market open date since Jan 1 to Dec 15 of 2016)
  2. Per minute for each of the 390 minutes the market is open for the past 10 days since 17 Dec ’16.

This gave me an opportunity to look at different trends in the data and to see if something of use could be gleaned from it.

Daily Tickers

Let’s first import our dependencies before going any further. The module data_libs contains some helper functions. They are available on my github here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.append('../data_libs/')
from data_libs import correlate
#this is to plot everything within jupyter
%matplotlib inline

import os
#the list of stocks
stocks=["MMM","ABT","ABBV","ACN","ATVI","AYI","ADBE","AAP","AES","AET","AMG","AFL","A","APD","AKAM","ALK","ALB","AA","ALXN","ALLE","AGN","ADS","LNT","ALL","GOOGL","GOOG","MO","AMZN","AEE","AAL","AEP","AXP","AIG","AMT","AWK","AMP","ABC","AME","AMGN","APH","APC","ADI","ANTM","AON","APA","AIV","AAPL","AMAT","ADM","AJG","AIZ","T","ADSK","ADP","AN","AZO","AVGO","AVB","AVY","BHI","BLL","BAC","BCR","BAX","BBT","BDX","BBBY","BRK-B","BBY","BIIB","BLK","HRB","BA","BWA","BXP","BSX","BMY","BF-B","CHRW","CA","COG","CPB","COF","CAH","KMX","CCL","CAT","CBG","CBS","CELG","CNC","CNP","CTL","CERN","CF","SCHW","CHK","CVX","CMG","CB","CHD","CI","XEC","CINF","CTAS","CSCO","C","CFG","CTXS","CME","CMS","COH","CTSH","CL","CMCSA","CMA","CAG","CXO","COP","ED","STZ","GLW","COST","CCI","CSRA","CSX","CMI","CVS","DHI","DHR","DRI","DVA","DE","DLPH","DAL","XRAY","DVN","DO","DLR","DFS","DISCA","DISCK","DG","DLTR","D","DOV","DOW","DPS","DTE","DD","DUK","DNB","ETFC","EMN","ETN","EBAY","ECL","EIX","EW","EA","EMC","EMR","ENDP","ETR","EOG","EQT","EFX","EQIX","EQR","ESS","EL","ES","EXC","EXPE","EXPD","ESRX","EXR","XOM","FFIV","FB","FAST","FRT","FDX","FIS","FITB","FSLR","FE","FISV","FLIR","FLS","FLR","FMC","FTI","FL","F","FTV","FBHS","BEN","FCX","FTR","GPS","GRMN","GD","GE","GGP","GIS","GM","GPC","GILD","GPN","GS","GT","GWW","HAL","HBI","HOG","HAR","HRS","HIG","HAS","HCA","HCP","HP","HSIC","HES","HPE","HOLX","HD","HON","HRL","HST","HPQ","HUM","HBAN","ITW","ILMN","IR","INTC","ICE","IBM","IP","IPG","IFF","INTU","ISRG","IVZ","IRM","JBHT","JEC","JNJ","JCI","JPM","JNPR","KSU","K","KEY","KMB","KIM","KMI","KLAC","KSS","KHC","KR","LB","LLL","LH","LRCX","LM","LEG","LEN","LUK","LVLT","LLY","LNC","LLTC","LKQ","LMT","L","LOW","LYB","MTB","MAC","M","MNK","MRO","MPC","MAR","MMC","MLM","MAS","MA","MAT","MKC","MCD","MCK","MJN","MDT","MRK","MET","KORS","MCHP","MU","MSFT","MHK","TAP","MDLZ","MON","MNST","MCO","MS","MSI","MUR","MYL","NDAQ","NOV","NAVI","NTAP","NFLX","NWL","NFX","NEM","NWSA","NWS","NEE","NLSN","NKE","NI","NBL","JWN","NSC","NTRS","NOC","NRG","NUE","NVDA","ORLY","OXY","OMC","OKE","ORCL","OI","PCAR","PH","PDCO","PAYX","PYPL","PNR","PBCT","PEP","PKI","PRGO","PFE","PCG","PM","PSX","PNW","PXD","PBI","PNC","RL","PPG","PPL","PX","PCLN","PFG","PG","PGR","PLD","PRU","PEG","PSA","PHM","PVH","QRVO","QCOM","PWR","DGX","RRC","RTN","O","RHT","REGN","RF","RSG","RAI","RHI","ROK","COL","ROP","ROST","RCL","R","SPGI","CRM","SCG","SLB","SNI","STX","SEE","SRE","SHW","SIG","SPG","SWKS","SLG","SJM","SNA","SO","LUV","SWN","SE","STJ","SWK","SPLS","SBUX","HOT","STT","SRCL","SYK","STI","SYMC","SYF","SYY","TROW","TGT","TEL","TGNA","TDC","TSO","TXN","TXT","BK","CLX","KO","HSY","MOS","TRV","DIS","TMO","TIF","TWX","TJX","TMK","TSS","TSCO","TDG","RIG","TRIP","FOXA","FOX","TYC","TSN","USB","UDR","ULTA","UA","UNP","UAL","UNH","UPS","URI","UTX","UHS","UNM","URBN","VFC","VLO","VAR","VTR","VRSN","VRSK","VZ","VRTX","VIAB","V","VNO","VMC","WMT","WBA","WM","WAT","WFC","HCN","WDC","WU","WRK","WY","WHR","WFM","WMB","WLTW","WEC","WYN","WYNN","XEL","XRX","XLNX","XL","XYL","YHOO","YUM","ZBH","ZION","ZTS"]
stock_pd = {}
for stock in stocks:
    path = os.getcwd() + '/../data/'+stock+'.csv'
    data = pd.read_csv(path)
    stock_pd[stock] = data

Now that we have all the csv data in memory, it needs some sanitisation. We know there were 242 trading days this year until 15 Dec, and we are only interested in the opening prices, so we only select those ones and add them to our new DataFrame. A DataFrame is useful since we can perform various funtions on it without much effort.

openings = {}
for key, val in stock_pd.items():
    if len(list(val['Open'])) == 242:
        openings[key]=list(val['Open'])
stocks_df = pd.DataFrame.from_dict(openings)
stocks = list(stocks_df.columns.values)

Let’s look at some basic things:

stocks_df.shape
(242, 497)
stocks_df.describe()
A AAL AAP AAPL ABBV ABC ABT ACN ADBE ADI XLNX XOM XRAY XRX XYL YHOO YUM ZBH ZION ZTS
count 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000 242.000000
mean 43.225248 37.728760 154.684711 104.012686 60.528388 83.420124 40.572190 112.493099 96.978636 59.616570 48.942231 85.933471 60.299876 9.791860 44.808512 37.530950 79.313886 113.993141 28.066364 47.603017
std 3.659674 4.671148 9.182969 7.329680 3.666874 6.205344 2.238767 6.842154 7.806769 5.806182 3.527123 4.654318 2.129507 0.545279 5.778593 4.472494 9.337674 12.127044 5.055868 3.604124
min 34.209999 26.120001 134.119995 90.000000 51.320000 69.129997 36.169998 92.540001 74.040001 48.160000 41.049999 73.160004 54.049999 8.720000 32.439999 26.459999 59.889999 88.449997 20.100000 38.980000
25% 39.892500 34.992502 147.997506 97.177498 57.549999 78.437498 38.804999 110.382497 92.740001 55.694999 46.565001 83.300003 59.077500 9.440000 40.992502 35.822501 71.312502 102.287498 24.672500 45.932501
50% 44.645001 37.825001 154.165001 104.849998 60.959999 84.654999 40.280000 114.340001 97.215000 58.780000 48.020001 86.834999 60.320002 9.720000 45.760000 37.470001 81.700003 115.960003 27.144999 48.030001
75% 46.307501 40.767499 161.020004 109.862497 63.382500 87.537502 42.250000 117.080002 102.634999 63.585000 51.980002 88.637499 61.707500 9.967500 49.997501 41.439999 87.507503 125.337498 30.587501 50.675000
max 48.630001 49.270000 176.410004 118.180000 67.470001 102.309998 45.720001 125.250000 110.760002 74.540001 58.290001 95.440002 64.889999 11.270000 54.730000 44.549999 91.590006 132.309998 43.160000 52.430000

8 rows × 497 columns

So we have 242 rows of opening prices of 497 stocks. Let’s select one stock and do some analysis.

Apple (AAPL)

Let’s take a look at the correlations of APPL with the other stocks.

#first ten correlated stocks - we copy the stocks list to remove apple without modifying the all stocks list we have
stocks_to_compare = list(stocks)
stocks_to_compare.remove('AAPL')
apple_correlations = correlate.correlate(stocks_df,"AAPL", stocks_to_compare)
apple_correlations[:10]
A       0.426812
AAL     0.508813
AAP     0.129514
ABBV    0.307767
ABC    -0.067380
ABT     0.449754
ACN     0.436478
ADBE    0.713172
ADI     0.738866
ADM     0.470422
dtype: float64

It would more interesting to see the stocks most correlated to AAPL.

apple_most_corr = apple_correlations.sort_values()[len(apple_correlations)-15:]
print (apple_most_corr)
STX      0.753511
URBN     0.762356
LLTC     0.763457
QCOM     0.765870
LRCX     0.768761
HOLX     0.770554
ADSK     0.777759
INTC     0.787496
ROST     0.796642
GLW      0.804015
NTAP     0.829960
MSFT     0.858131
SWKS     0.870930
GOOG     0.882116
GOOGL    0.890361
dtype: float64
plt.plot(list(apple_most_corr))
plt.xticks(np.arange(15),list(apple_most_corr.index))
png

These are the 10 mose correlated stocks to Apple over the past year. GOOGL (google) seems to be very much correlated to Apple, so does MSFT (microsoft). There are, however, a few unusual ones, such as ROST (Ross Stores which is a departmental stores brand, and URBN (Urban Outfitters). Usually, tech companies are correlated as a whole, and have similar movements over the long run. But seeing URBN and ROST is a bit strange. Let’s take a deeper look.

twenty_day_moving_corr = correlate.moving_corrs(stocks_df, "AAPL", "ROST", 20)
plt.plot(list(twenty_day_moving_corr))
png

As we can see the correlation for ROST varies wildly over a 20 day moving correlation window.

twenty_day_moving_corr = correlate.moving_corrs(stocks_df, "AAPL", "URBN", 20)
plt.plot(list(twenty_day_moving_corr))
png

URBN is a bit different. It switched it’s relationship about 40 days into the year, and rose in the positive relationship with APPL since. Let’s take a look at their stock price graphs.

plt.plot(stocks_df["URBN"])
plt.plot(stocks_df["AAPL"])
png

AAPL in green and URBN in black. As we can see there is some similar movements in the prices. There could be another variable which might effect both of these, or the similarity in movement could be purely coincidental. It is difficult to say.

Regression Model – Reducing Predictors

Let’s assume we can predict AAPL’s day n+1 stock prices based on the rest of the stock’s prices on day n.

We currently have way too many predictors in our dataframe. If we want to make a model to predict the price of AAPL stock based on the rest of the stocks, we need to reduce the number of predictors.

from sklearn.decomposition import PCA
stocks_df_without_apple = stocks_df.copy()
del stocks_df_without_apple["AAPL"]
pca = PCA(n_components=3)
pca.fit(stocks_df_without_apple)
PCA(copy=True, n_components=3, whiten=False)
print(pca.explained_variance_ratio_) 
[ 0.63283094  0.14793351  0.06302373]

We now have three principle components. Interpreting these components requires a different post, but the idea is to reduce dimensions of the data in a way that the resulting dimensions are orthogonal (i.e. they have zero correlation with each other).

Let’s transform our data using the PCA which we can then do a regression on.

stocks_pca = pca.transform(stocks_df_without_apple)
stocks_pca.shape
(242, 3)

Let’s first normalise the data to get the timeshift we need.

from data_libs import lag
lagged_stocks_pca = lag.shift(stocks_pca, -1)
lagged_apple = lag.shift(stocks_df["AAPL"],1)
from sklearn.linear_model import LinearRegression as LR
model = LR()
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(lagged_stocks_pca, lagged_apple, test_size=0.3, random_state=0)
model.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
model.score(X_test, y_test)
0.65330220629496782

The model score gives the R squared of the model. An R squared of 0.65 means that our model explains 65% of the variance in the price of Apple stock. This is pretty dismal. Let’s take a look at the root mean squared error for the predictions in the test set.

from sklearn import metrics
np.sqrt(metrics.mean_squared_error(y_test,model.predict(X_test)))
4.3187068988897206

This means our predicted values vary about $4 from the actual stock prices. This means the model is of little use.

plt.plot(lagged_apple, color="RED")
plt.plot(model.predict(lagged_stocks_pca),color="BLACK")
png

As can be seen here, our model does predict the general trend correctly, but it is still not accurate. This is probably because we have very little data points that we are working with. Let’s take a look at the minute ticker data in the next post.