Resampling Data in Hadoop with RHadoop

On Revolution Analytics partner Cloudera’s blog, Uri Laserson has posted an excellent guide to resampling from a large data set in Hadoop. Resampling is an important step in fitting ensemble models (including random forests and other bagging techniques), and Uri provides a step-by-step guide to implementing resampling methods using RHadoop.

By the way, if you’re new to RHadoop, here’s RHadoop creator and project leader Antonio Piccolboni introducting RHadoop at last year’s Strata CA conference.

Cloudera blog: How-to: Resample from a Large Data Set in Parallel (with R on Hadoop)