*In this post, Revolution engineer Sherry LaMonica shows us how to use the RevoScaleR big-data package in Revolution R Enterprise to do principal components analysis on 50 years of stock market data — ed.*

*In this post, Revolution engineer Sherry LaMonica shows us how to use the RevoScaleR big-data package in Revolution R Enterprise to do principal components analysis on 50 years of stock market data — ed.*

Principal components analysis, or PCA, seeks to find a set of orthogonal axes such that the first axis, or first principal component, accounts for as much variability as possible and subsequent axes are chosen to maximize variance while maintaining orthogonality with previous axes. Principal components are typically computed either by a singular value decomposition of the data matrix or an eigenvalue decomposition of a covariance or correlation matrix; the latter permits us to use the RevoScaleR function *rxCovCor* with the standard R function* princomp*.

Stock market data for open, high, low, close, and adjusted close from 1962 to 2010 is available from InfoChimps. As you might expect, these data are highly correlated, and principal components analysis can be used for data reduction. We read the original data (a set of 26 comma-separated text files, where each file is represented by a letter in the alphabet) into an .xdf file, NYSE_daily_prices.xdf:

nyseDataDir <- "C:/Users/Sherry/Downloads/NYSE" dataSourceName <- file.path(nyseDataDir, "NYSE_daily_prices") dataFileName <- "NYSE_daily_prices.xdf" append <- "none" for (i in LETTERS) { importFile <- paste(dataSourceName, "_", i, ".csv", sep="") rxTextToXdf(importFile, dataFileName, stringsAsFactors=TRUE, append=append) append <- "rows" }

The full data set includes 9.2 million observations of daily open-high-low-close data for some 2800 stocks:

> rxGetInfoXdf(dataFileName) File name: NYSE_daily_prices.xdf Number of observations: 9211031 Number of variables: 9 Number of blocks: 34

We will use the rxCor function to calculate the Pearson’s correlation matrix for the variable specified, and pass this to the princomp function:

This yields the following output:

> summary(stockPca) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 2.0756631 0.8063270 0.197632281 0.0454173922 Proportion of Variance 0.8616755 0.1300327 0.007811704 0.0004125479 Cumulative Proportion 0.8616755 0.9917081 0.999519853 0.9999324005 Comp.5 Standard deviation 1.838470e-02 Proportion of Variance 6.759946e-05 Cumulative Proportion 1.000000e+00 > loadings(stockPca) Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 stock_price_open -0.470 -0.166 0.867 stock_price_high -0.477 -0.151 -0.276 0.410 -0.711 stock_price_low -0.477 -0.153 -0.282 0.417 0.704 stock_price_close -0.477 -0.149 -0.305 -0.811 stock_price_adj_close -0.309 0.951 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 SS loadings 1.0 1.0 1.0 1.0 1.0 Proportion Var 0.2 0.2 0.2 0.2 0.2 Cumulative Var 0.2 0.4 0.6 0.8 1.0

The default plot method for objects of class princomp is a screeplot, which is a barplot of the variances of the principal components. We can obtain the plot as usual by calling plot with our principal components object:

> plot(stockPca)

Between them, the first two principal components explain 99% of the variance; we can therefore replace the five original variables by these two principal components with no appreciable loss of information.