Packages for By-Group Processing in R

February 24, 2011
55 Views

Analyst and BI expert Steve Miller takes a look at the facilities in R for doing “by-group” processing of data. The task consisted of:

… read several text files, merge the results, reshape the intermediate data, calculate some new variables, take care of missing values, attend to meta data, execute a few predictive models and graph the results.

Analyst and BI expert Steve Miller takes a look at the facilities in R for doing “by-group” processing of data. The task consisted of:

… read several text files, merge the results, reshape the intermediate data, calculate some new variables, take care of missing values, attend to meta data, execute a few predictive models and graph the results.

Then repeat the models and graphs for groups or sub-populations marked by distinct values of one or more dimension variables of interest.

The latter step is commonly referred to as “by-group processing.” SAS programmers will recognize by group processing with syntax that invokes a procedure on a sorted data set that looks something like:

proc reg data = dblahblah; by vblahblah;

Check out Steve’s post for how he addressed this in R using the high-performance data.table package by Matthew Dowle (and as Steve suggests, a good place to get started is the example vignettes). 

I’d also add a recommendation for the plyr package which also offers tools to split up data sets by various criteria, and then do by-processing. Here, the plyr: divide and conquer guide is a good place to start. As an added bonus, you can also divide and conquer the computations by exploiting multiple nodes in parallel by engaging a parallel backend for the foreach function. (Note for Windows users: the doSMP backend from Revolution R is also available now on R-Forge and will be on CRAN soon, too.)

Information Management: By-Group Processing, the R data.table and the Power of Open Source