Interpretation of PCA Results

By taha No comments

Say we already have the results from the last post. Before going into the minute tickers data, let’s take a look at how PCA works and why we need it.

PCA stands for Principle Components Analysis. It is one of the methods for reducing the dimensions in a dataset. It works by working out in which dimensions does the data vary most, and realigning the data along those axes, hence reducing the dimensions. This is easier to understand in a 3d-2d conversion. A wonderful article showing that is here.

Let’s look at the code from scikit learn.

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(stocks_df)
PCA(copy=True, n_components=3, whiten=False)
print(pca.explained_variance_ratio_) 
[ 0.63264413  0.1479355   0.06299594]
pca_df = pd.DataFrame.from_records(pca.transform(stocks_df))

Interpretation

Now that we have a dataset transformed from the 496 stocks to around 3 Principle Components (PC), how do we interpret this in an intuitive manner?

One way is to look at how each of the transformed component varies with the original predictors.

For each of the PC, we do a correlation with the stocks in the initial DataFrame, and only choose those above a certain threshold.

cond = (pca_1_corr>0.8) | (pca_1_corr<-0.95)
pca_1_corr = stocks_df.corrwith(pca_df[0])
pca_1_corr[cond].sort_values()
HPQ    -0.972750
HPE    -0.972328
TXN    -0.972032
XYL    -0.967591
AMAT   -0.966321
MCHP   -0.965133
ITW    -0.964074
DGX    -0.960560
SPGI   -0.957677
CXO    -0.956798
IP     -0.953548
YHOO   -0.952876
MRK    -0.952153
AJG    -0.952129
XEC    -0.952063
PXD    -0.951325
LRCX   -0.951040
EOG    -0.951011
KR      0.815645
ENDP    0.817135
SRCL    0.834309
EQR     0.875035
PRGO    0.880227
FSLR    0.900458
dtype: float64

This shows us that for the first PC, a lot of components are captured. I had to choose a very high threshold to limit the results shown here, but it goes without saying that our initial PC does a great job at capturing the variance from the data. It also shows which stocks vary together. We will do this for the next two PCs as well.

cond = (pca_2_corr>0.8) | (pca_2_corr<-0.8)
pca_2_corr = stocks_df.corrwith(pca_df[1])
pca_2_corr[cond].sort_values()
UAL    -0.839301
AAL    -0.830186
DAL    -0.802248
SPG     0.807069
HCN     0.807453
MCK     0.814307
CAG     0.814368
NLSN    0.827102
KIM     0.831171
FRT     0.841026
ICE     0.870310
YUM     0.870510
MNST    0.909921
dtype: float64      
cond = (pca_3_corr>0.7) | (pca_3_corr<-0.5)
pca_3_corr = stocks_df.corrwith(pca_df[2])
pca_3_corr[cond].sort_values()
ILMN    -0.533749
NWSA    -0.512770
SYMC    -0.511617
NWS     -0.508223
HRL     -0.505995
REGN    -0.503926
BIIB    -0.501072
GOOGL   -0.500829
NRG      0.729854
DG       0.731487
FOXA     0.734103
COH      0.739231
FOX      0.759672
PEG      0.783842
CHRW     0.790536
DO       0.792061
FAST     0.804535
TROW     0.808688
CPB      0.843875
dtype: float64

This leads to a great deal of reduction in our dimensions, while keeping most of the information intact.

pca.explained_variance_ratio_.sum()
0.843575569520291

This shows that our 3 PCs capture 84.36% of the variance of the intial dataset of 496 dimensions.