Interpretation of PCA Results
Say we already have the results from the last post. Before going into the minute tickers data, let’s take a look at how PCA works and why we need it.
PCA stands for Principle Components Analysis. It is one of the methods for reducing the dimensions in a dataset. It works by working out in which dimensions does the data vary most, and realigning the data along those axes, hence reducing the dimensions. This is easier to understand in a 3d-2d conversion. A wonderful article showing that is here.
Let’s look at the code from scikit learn.
from sklearn.decomposition import PCA
pca = PCA(n_components=3) pca.fit(stocks_df)
PCA(copy=True, n_components=3, whiten=False)
[ 0.63264413 0.1479355 0.06299594]
pca_df = pd.DataFrame.from_records(pca.transform(stocks_df))
Now that we have a dataset transformed from the 496 stocks to around 3 Principle Components (PC), how do we interpret this in an intuitive manner?
One way is to look at how each of the transformed component varies with the original predictors.
For each of the PC, we do a correlation with the stocks in the initial DataFrame, and only choose those above a certain threshold.
cond = (pca_1_corr>0.8) | (pca_1_corr<-0.95) pca_1_corr = stocks_df.corrwith(pca_df) pca_1_corr[cond].sort_values()
HPQ -0.972750 HPE -0.972328 TXN -0.972032 XYL -0.967591 AMAT -0.966321 MCHP -0.965133 ITW -0.964074 DGX -0.960560 SPGI -0.957677 CXO -0.956798 IP -0.953548 YHOO -0.952876 MRK -0.952153 AJG -0.952129 XEC -0.952063 PXD -0.951325 LRCX -0.951040 EOG -0.951011 KR 0.815645 ENDP 0.817135 SRCL 0.834309 EQR 0.875035 PRGO 0.880227 FSLR 0.900458 dtype: float64
This shows us that for the first PC, a lot of components are captured. I had to choose a very high threshold to limit the results shown here, but it goes without saying that our initial PC does a great job at capturing the variance from the data. It also shows which stocks vary together. We will do this for the next two PCs as well.
cond = (pca_2_corr>0.8) | (pca_2_corr<-0.8) pca_2_corr = stocks_df.corrwith(pca_df) pca_2_corr[cond].sort_values()
UAL -0.839301 AAL -0.830186 DAL -0.802248 SPG 0.807069 HCN 0.807453 MCK 0.814307 CAG 0.814368 NLSN 0.827102 KIM 0.831171 FRT 0.841026 ICE 0.870310 YUM 0.870510 MNST 0.909921 dtype: float64
cond = (pca_3_corr>0.7) | (pca_3_corr<-0.5) pca_3_corr = stocks_df.corrwith(pca_df) pca_3_corr[cond].sort_values()
ILMN -0.533749 NWSA -0.512770 SYMC -0.511617 NWS -0.508223 HRL -0.505995 REGN -0.503926 BIIB -0.501072 GOOGL -0.500829 NRG 0.729854 DG 0.731487 FOXA 0.734103 COH 0.739231 FOX 0.759672 PEG 0.783842 CHRW 0.790536 DO 0.792061 FAST 0.804535 TROW 0.808688 CPB 0.843875 dtype: float64
This leads to a great deal of reduction in our dimensions, while keeping most of the information intact.
This shows that our 3 PCs capture 84.36% of the variance of the intial dataset of 496 dimensions.