Stock Price Data Analysis – 2

By taha No comments

Let’s start by loading the data into memory first. The data is per minute stock prices for the SNP500 bucket for the 10 days prior to 17 Dec. This means we have 390 x 10 so about 3900 rows per stock. This, however, is not absolutely true since some data has holes in it – i.e. some stocks are missing some minute values. This needs to be taken care of after loading in the data.

Data sanitisation can sometimes take a large part of the time one spends on data-analysis.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.append('../data_libs/')
from data_libs import correlate
#this is to plot everything within jupyter
%matplotlib inline

import os

#the list of stocks
stocks=["MMM","ABT","ABBV","ACN","ATVI","AYI","ADBE","AAP","AES","AET","AMG","AFL","A","APD","AKAM","ALK","ALB","AA","ALXN","ALLE","AGN","ADS","LNT","ALL","GOOGL","GOOG","MO","AMZN","AEE","AAL","AEP","AXP","AIG","AMT","AWK","AMP","ABC","AME","AMGN","APH","APC","ADI","ANTM","AON","APA","AIV","AAPL","AMAT","ADM","AJG","AIZ","T","ADSK","ADP","AN","AZO","AVGO","AVB","AVY","BHI","BLL","BAC","BCR","BAX","BBT","BDX","BBBY","BRK-B","BBY","BIIB","BLK","HRB","BA","BWA","BXP","BSX","BMY","BF-B","CHRW","CA","COG","CPB","COF","CAH","KMX","CCL","CAT","CBG","CBS","CELG","CNC","CNP","CTL","CERN","CF","SCHW","CHK","CVX","CMG","CB","CHD","CI","XEC","CINF","CTAS","CSCO","C","CFG","CTXS","CME","CMS","COH","CTSH","CL","CMCSA","CMA","CAG","CXO","COP","ED","STZ","GLW","COST","CCI","CSRA","CSX","CMI","CVS","DHI","DHR","DRI","DVA","DE","DLPH","DAL","XRAY","DVN","DO","DLR","DFS","DISCA","DISCK","DG","DLTR","D","DOV","DOW","DPS","DTE","DD","DUK","DNB","ETFC","EMN","ETN","EBAY","ECL","EIX","EW","EA","EMC","EMR","ENDP","ETR","EOG","EQT","EFX","EQIX","EQR","ESS","EL","ES","EXC","EXPE","EXPD","ESRX","EXR","XOM","FFIV","FB","FAST","FRT","FDX","FIS","FITB","FSLR","FE","FISV","FLIR","FLS","FLR","FMC","FTI","FL","F","FTV","FBHS","BEN","FCX","FTR","GPS","GRMN","GD","GE","GGP","GIS","GM","GPC","GILD","GPN","GS","GT","GWW","HAL","HBI","HOG","HAR","HRS","HIG","HAS","HCA","HCP","HP","HSIC","HES","HPE","HOLX","HD","HON","HRL","HST","HPQ","HUM","HBAN","ITW","ILMN","IR","INTC","ICE","IBM","IP","IPG","IFF","INTU","ISRG","IVZ","IRM","JBHT","JEC","JNJ","JCI","JPM","JNPR","KSU","K","KEY","KMB","KIM","KMI","KLAC","KSS","KHC","KR","LB","LLL","LH","LRCX","LM","LEG","LEN","LUK","LVLT","LLY","LNC","LLTC","LKQ","LMT","L","LOW","LYB","MTB","MAC","M","MNK","MRO","MPC","MAR","MMC","MLM","MAS","MA","MAT","MKC","MCD","MCK","MJN","MDT","MRK","MET","KORS","MCHP","MU","MSFT","MHK","TAP","MDLZ","MON","MNST","MCO","MS","MSI","MUR","MYL","NDAQ","NOV","NAVI","NTAP","NFLX","NWL","NFX","NEM","NWSA","NWS","NEE","NLSN","NKE","NI","NBL","JWN","NSC","NTRS","NOC","NRG","NUE","NVDA","ORLY","OXY","OMC","OKE","ORCL","OI","PCAR","PH","PDCO","PAYX","PYPL","PNR","PBCT","PEP","PKI","PRGO","PFE","PCG","PM","PSX","PNW","PXD","PBI","PNC","RL","PPG","PPL","PX","PCLN","PFG","PG","PGR","PLD","PRU","PEG","PSA","PHM","PVH","QRVO","QCOM","PWR","DGX","RRC","RTN","O","RHT","REGN","RF","RSG","RAI","RHI","ROK","COL","ROP","ROST","RCL","R","SPGI","CRM","SCG","SLB","SNI","STX","SEE","SRE","SHW","SIG","SPG","SWKS","SLG","SJM","SNA","SO","LUV","SWN","SE","STJ","SWK","SPLS","SBUX","HOT","STT","SRCL","SYK","STI","SYMC","SYF","SYY","TROW","TGT","TEL","TGNA","TDC","TSO","TXN","TXT","BK","CLX","KO","HSY","MOS","TRV","DIS","TMO","TIF","TWX","TJX","TMK","TSS","TSCO","TDG","RIG","TRIP","FOXA","FOX","TYC","TSN","USB","UDR","ULTA","UA","UNP","UAL","UNH","UPS","URI","UTX","UHS","UNM","URBN","VFC","VLO","VAR","VTR","VRSN","VRSK","VZ","VRTX","VIAB","V","VNO","VMC","WMT","WBA","WM","WAT","WFC","HCN","WDC","WU","WRK","WY","WHR","WFM","WMB","WLTW","WEC","WYN","WYNN","XEL","XRX","XLNX","XL","XYL","YHOO","YUM","ZBH","ZION","ZTS"]
stock_pd = {}
for stock in stocks:
    path = os.getcwd() + '/../data/intraday/'+stock+'.csv'
    data = pd.read_csv(path, names=["DATE","CLOSE","HIGH","LOW","OPEN","VOLUME"], skiprows=7)
    stock_pd[stock] = data
openings = {}
for stock_name, stock_values in stock_pd.items():
    this_openings = [None] * 3910
    day = -1
    for index, mins in enumerate(list(stock_values["DATE"])):
        try:
            if "a" in mins:
                mins = 0
                day = day+1
            mins = int(mins)
            opening_index = (390 * day) + mins + day
            this_openings[opening_index] = stock_values["OPEN"][index]
        except IndexError:
            print (index)
    openings[stock_name] = this_openings
openings_df = pd.DataFrame.from_dict(openings)

What I do here is to first only select the opening prices in the data, and then index them by the date that we have. The row denoting a change in date contains the unix time starting with the letter “a”. This can identify a change in date, and can thus help with the indexing. The data that we now have contains 3910 rows, but also contains holes in the data.

openings_df.shape
(3910, 504)
openings_df.head()
A AA AAL AAP AAPL ABBV ABC ABT ACN ADBE XLNX XOM XRAY XRX XYL YHOO YUM ZBH ZION ZTS
0 44.5600 31.30 46.610 172.380 109.2600 60.26 78.190 38.13 118.650 100.89 54.5000 87.63 57.840 9.3200 52.2000 39.9800 63.42 104.230 41.88 51.20
1 44.7400 31.35 46.640 173.215 109.2800 60.19 77.035 38.10 118.615 101.40 54.4900 87.67 NaN 9.3500 52.1388 39.8600 63.38 104.430 41.88 50.14
2 44.5900 31.38 46.600 NaN 109.3700 60.28 77.045 38.14 118.600 101.46 54.5600 87.55 58.145 9.4200 52.1900 39.9500 63.45 104.795 41.72 50.10
3 44.5100 31.50 46.710 NaN 109.6905 60.24 76.550 38.02 118.760 101.67 54.5093 87.40 58.160 9.4300 52.2900 40.0600 63.65 105.160 41.73 50.12
4 44.5864 31.70 46.785 NaN 109.7700 60.17 76.660 37.94 118.790 101.78 54.6600 87.58 58.050 9.4745 52.2000 40.1198 63.70 104.650 41.75 50.08

5 rows × 504 columns

openings_df = openings_df.interpolate()

We fill up the holes (the Nan values) by interpolating the data. This will make our next steps less painful.

Let’s look at the correlated stocks.

openings_df = openings_df.dropna(axis=1)
stocks_to_compare = list(openings_df.columns)
stocks_to_compare.remove('AAPL')
apple_correlations = correlate.correlate(openings_df,"AAPL", stocks_to_compare)
apple_most_corr = apple_correlations.sort_values()[len(apple_correlations)-15:]
print (apple_most_corr)
XRAY    0.908866
SYY     0.916308
SYK     0.917560
KHC     0.919165
BSX     0.920447
FISV    0.920813
EXC     0.923402
CTXS    0.926579
AMT     0.928038
CELG    0.930203
INTC    0.936103
MSFT    0.939212
MCD     0.940888
BMY     0.947866
SRCL    0.950906
dtype: float64
plt.plot(openings_df["AAPL"], color="RED")
plt.plot(openings_df["BMY"], color="BLUE")
plt.plot(openings_df["SRCL"], color="BLUE")
plt.plot(openings_df["MCD"], color="BLUE")
png
plt.plot(correlate.moving_corrs(openings_df, "AAPL", "BMY", 90))
png
plt.plot(correlate.moving_corrs(openings_df, "AAPL", "SRCL", 390))
png

This data is very granular, and is not susceptible to change as much. However, stock price momentum is often used by traders for intraday trading. Let’s see if we can try to predict APPL’s price change based on it’s momentum (rate of change).

from data_libs import classifier
apple_rise_fall = classifier.rise_fall(openings_df["AAPL"])
apple_percentage_change = classifier.percentage_change(openings_df["AAPL"])

Let’s say we want to train a regressor which will take into account the past 20 minute movements of the stock price and predict whether it will rise or fall in the next minute. We need to create a new tuple set which maps rise or fall with the past 20 values of the stock.

apple_twenty_minutes_movements = []
for i in range(0,len(apple_percentage_change)-20):
    apple_twenty_minutes_movements.append(apple_percentage_change[i:i+20])    
apple_rise_fall = apple_rise_fall[20:]
from sklearn.svm import LinearSVC as SVC
model = SVC()
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(apple_twenty_minutes_movements, apple_rise_fall, test_size=0.3, random_state=0)
model.fit(X_train,y_train)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
model.score(X_test,y_test)
0.54498714652956293

An r-squared 0.545 says a lot about using momentum to predict stock prices. This is nearly the same as a coin-tosss, and the random walk theory suggests that this should be expected.

Great care should be taken when using stock-momentum intuition over short intervals. Over the long-run, such a strategy will surely not make much money.