Stock Price Data Analysis – 2
Let’s start by loading the data into memory first. The data is per minute stock prices for the SNP500 bucket for the 10 days prior to 17 Dec. This means we have 390 x 10 so about 3900 rows per stock. This, however, is not absolutely true since some data has holes in it – i.e. some stocks are missing some minute values. This needs to be taken care of after loading in the data.
Data sanitisation can sometimes take a large part of the time one spends on data-analysis.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.append('../data_libs/')
from data_libs import correlate
#this is to plot everything within jupyter
%matplotlib inline
import os
#the list of stocks
stocks=["MMM","ABT","ABBV","ACN","ATVI","AYI","ADBE","AAP","AES","AET","AMG","AFL","A","APD","AKAM","ALK","ALB","AA","ALXN","ALLE","AGN","ADS","LNT","ALL","GOOGL","GOOG","MO","AMZN","AEE","AAL","AEP","AXP","AIG","AMT","AWK","AMP","ABC","AME","AMGN","APH","APC","ADI","ANTM","AON","APA","AIV","AAPL","AMAT","ADM","AJG","AIZ","T","ADSK","ADP","AN","AZO","AVGO","AVB","AVY","BHI","BLL","BAC","BCR","BAX","BBT","BDX","BBBY","BRK-B","BBY","BIIB","BLK","HRB","BA","BWA","BXP","BSX","BMY","BF-B","CHRW","CA","COG","CPB","COF","CAH","KMX","CCL","CAT","CBG","CBS","CELG","CNC","CNP","CTL","CERN","CF","SCHW","CHK","CVX","CMG","CB","CHD","CI","XEC","CINF","CTAS","CSCO","C","CFG","CTXS","CME","CMS","COH","CTSH","CL","CMCSA","CMA","CAG","CXO","COP","ED","STZ","GLW","COST","CCI","CSRA","CSX","CMI","CVS","DHI","DHR","DRI","DVA","DE","DLPH","DAL","XRAY","DVN","DO","DLR","DFS","DISCA","DISCK","DG","DLTR","D","DOV","DOW","DPS","DTE","DD","DUK","DNB","ETFC","EMN","ETN","EBAY","ECL","EIX","EW","EA","EMC","EMR","ENDP","ETR","EOG","EQT","EFX","EQIX","EQR","ESS","EL","ES","EXC","EXPE","EXPD","ESRX","EXR","XOM","FFIV","FB","FAST","FRT","FDX","FIS","FITB","FSLR","FE","FISV","FLIR","FLS","FLR","FMC","FTI","FL","F","FTV","FBHS","BEN","FCX","FTR","GPS","GRMN","GD","GE","GGP","GIS","GM","GPC","GILD","GPN","GS","GT","GWW","HAL","HBI","HOG","HAR","HRS","HIG","HAS","HCA","HCP","HP","HSIC","HES","HPE","HOLX","HD","HON","HRL","HST","HPQ","HUM","HBAN","ITW","ILMN","IR","INTC","ICE","IBM","IP","IPG","IFF","INTU","ISRG","IVZ","IRM","JBHT","JEC","JNJ","JCI","JPM","JNPR","KSU","K","KEY","KMB","KIM","KMI","KLAC","KSS","KHC","KR","LB","LLL","LH","LRCX","LM","LEG","LEN","LUK","LVLT","LLY","LNC","LLTC","LKQ","LMT","L","LOW","LYB","MTB","MAC","M","MNK","MRO","MPC","MAR","MMC","MLM","MAS","MA","MAT","MKC","MCD","MCK","MJN","MDT","MRK","MET","KORS","MCHP","MU","MSFT","MHK","TAP","MDLZ","MON","MNST","MCO","MS","MSI","MUR","MYL","NDAQ","NOV","NAVI","NTAP","NFLX","NWL","NFX","NEM","NWSA","NWS","NEE","NLSN","NKE","NI","NBL","JWN","NSC","NTRS","NOC","NRG","NUE","NVDA","ORLY","OXY","OMC","OKE","ORCL","OI","PCAR","PH","PDCO","PAYX","PYPL","PNR","PBCT","PEP","PKI","PRGO","PFE","PCG","PM","PSX","PNW","PXD","PBI","PNC","RL","PPG","PPL","PX","PCLN","PFG","PG","PGR","PLD","PRU","PEG","PSA","PHM","PVH","QRVO","QCOM","PWR","DGX","RRC","RTN","O","RHT","REGN","RF","RSG","RAI","RHI","ROK","COL","ROP","ROST","RCL","R","SPGI","CRM","SCG","SLB","SNI","STX","SEE","SRE","SHW","SIG","SPG","SWKS","SLG","SJM","SNA","SO","LUV","SWN","SE","STJ","SWK","SPLS","SBUX","HOT","STT","SRCL","SYK","STI","SYMC","SYF","SYY","TROW","TGT","TEL","TGNA","TDC","TSO","TXN","TXT","BK","CLX","KO","HSY","MOS","TRV","DIS","TMO","TIF","TWX","TJX","TMK","TSS","TSCO","TDG","RIG","TRIP","FOXA","FOX","TYC","TSN","USB","UDR","ULTA","UA","UNP","UAL","UNH","UPS","URI","UTX","UHS","UNM","URBN","VFC","VLO","VAR","VTR","VRSN","VRSK","VZ","VRTX","VIAB","V","VNO","VMC","WMT","WBA","WM","WAT","WFC","HCN","WDC","WU","WRK","WY","WHR","WFM","WMB","WLTW","WEC","WYN","WYNN","XEL","XRX","XLNX","XL","XYL","YHOO","YUM","ZBH","ZION","ZTS"]
stock_pd = {}
for stock in stocks:
path = os.getcwd() + '/../data/intraday/'+stock+'.csv'
data = pd.read_csv(path, names=["DATE","CLOSE","HIGH","LOW","OPEN","VOLUME"], skiprows=7)
stock_pd[stock] = data
openings = {}
for stock_name, stock_values in stock_pd.items():
this_openings = [None] * 3910
day = -1
for index, mins in enumerate(list(stock_values["DATE"])):
try:
if "a" in mins:
mins = 0
day = day+1
mins = int(mins)
opening_index = (390 * day) + mins + day
this_openings[opening_index] = stock_values["OPEN"][index]
except IndexError:
print (index)
openings[stock_name] = this_openings
openings_df = pd.DataFrame.from_dict(openings)
What I do here is to first only select the opening prices in the data, and then index them by the date that we have. The row denoting a change in date contains the unix time starting with the letter “a”. This can identify a change in date, and can thus help with the indexing. The data that we now have contains 3910 rows, but also contains holes in the data.
openings_df.shape
(3910, 504)
openings_df.head()
A | AA | AAL | AAP | AAPL | ABBV | ABC | ABT | ACN | ADBE | … | XLNX | XOM | XRAY | XRX | XYL | YHOO | YUM | ZBH | ZION | ZTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 44.5600 | 31.30 | 46.610 | 172.380 | 109.2600 | 60.26 | 78.190 | 38.13 | 118.650 | 100.89 | … | 54.5000 | 87.63 | 57.840 | 9.3200 | 52.2000 | 39.9800 | 63.42 | 104.230 | 41.88 | 51.20 |
1 | 44.7400 | 31.35 | 46.640 | 173.215 | 109.2800 | 60.19 | 77.035 | 38.10 | 118.615 | 101.40 | … | 54.4900 | 87.67 | NaN | 9.3500 | 52.1388 | 39.8600 | 63.38 | 104.430 | 41.88 | 50.14 |
2 | 44.5900 | 31.38 | 46.600 | NaN | 109.3700 | 60.28 | 77.045 | 38.14 | 118.600 | 101.46 | … | 54.5600 | 87.55 | 58.145 | 9.4200 | 52.1900 | 39.9500 | 63.45 | 104.795 | 41.72 | 50.10 |
3 | 44.5100 | 31.50 | 46.710 | NaN | 109.6905 | 60.24 | 76.550 | 38.02 | 118.760 | 101.67 | … | 54.5093 | 87.40 | 58.160 | 9.4300 | 52.2900 | 40.0600 | 63.65 | 105.160 | 41.73 | 50.12 |
4 | 44.5864 | 31.70 | 46.785 | NaN | 109.7700 | 60.17 | 76.660 | 37.94 | 118.790 | 101.78 | … | 54.6600 | 87.58 | 58.050 | 9.4745 | 52.2000 | 40.1198 | 63.70 | 104.650 | 41.75 | 50.08 |
5 rows × 504 columns
openings_df = openings_df.interpolate()
We fill up the holes (the Nan values) by interpolating the data. This will make our next steps less painful.
Let’s look at the correlated stocks.
openings_df = openings_df.dropna(axis=1)
stocks_to_compare = list(openings_df.columns)
stocks_to_compare.remove('AAPL')
apple_correlations = correlate.correlate(openings_df,"AAPL", stocks_to_compare)
apple_most_corr = apple_correlations.sort_values()[len(apple_correlations)-15:]
print (apple_most_corr)
XRAY 0.908866
SYY 0.916308
SYK 0.917560
KHC 0.919165
BSX 0.920447
FISV 0.920813
EXC 0.923402
CTXS 0.926579
AMT 0.928038
CELG 0.930203
INTC 0.936103
MSFT 0.939212
MCD 0.940888
BMY 0.947866
SRCL 0.950906
dtype: float64
plt.plot(openings_df["AAPL"], color="RED")
plt.plot(openings_df["BMY"], color="BLUE")
plt.plot(openings_df["SRCL"], color="BLUE")
plt.plot(openings_df["MCD"], color="BLUE")
plt.plot(correlate.moving_corrs(openings_df, "AAPL", "BMY", 90))
plt.plot(correlate.moving_corrs(openings_df, "AAPL", "SRCL", 390))
This data is very granular, and is not susceptible to change as much. However, stock price momentum is often used by traders for intraday trading. Let’s see if we can try to predict APPL’s price change based on it’s momentum (rate of change).
from data_libs import classifier
apple_rise_fall = classifier.rise_fall(openings_df["AAPL"])
apple_percentage_change = classifier.percentage_change(openings_df["AAPL"])
Let’s say we want to train a regressor which will take into account the past 20 minute movements of the stock price and predict whether it will rise or fall in the next minute. We need to create a new tuple set which maps rise or fall with the past 20 values of the stock.
apple_twenty_minutes_movements = []
for i in range(0,len(apple_percentage_change)-20):
apple_twenty_minutes_movements.append(apple_percentage_change[i:i+20])
apple_rise_fall = apple_rise_fall[20:]
from sklearn.svm import LinearSVC as SVC
model = SVC()
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(apple_twenty_minutes_movements, apple_rise_fall, test_size=0.3, random_state=0)
model.fit(X_train,y_train)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0)
model.score(X_test,y_test)
0.54498714652956293
An r-squared 0.545 says a lot about using momentum to predict stock prices. This is nearly the same as a coin-tosss, and the random walk theory suggests that this should be expected.
Great care should be taken when using stock-momentum intuition over short intervals. Over the long-run, such a strategy will surely not make much money.