Alpha Testing RevoScaleR Running in Hadoop

September 16, 2013

ImageAt Revolution Analytics our mission is to establish R as the driver for Enterprise level computational frameworks. In part, this means that a data scientist ought to be able to develop an R based application in one context, e.g.

ImageAt Revolution Analytics our mission is to establish R as the driver for Enterprise level computational frameworks. In part, this means that a data scientist ought to be able to develop an R based application in one context, e.g. her local PC, and then get it moving by changing horses on the fly (so to speak) and have it run on a platform with more horsepower with minimum acrobatics. For example, I usually work on my Windows laptop using the IDE included with Revolution R Enterprise for Windows. Most of the time my compute context is set to either rxLocalSeq which indicates that all commands will execute sequentially on my notebook, or rxLocalParallel which enables RevoScaleR parallel external memory algorithms (PEMAs) to execute in parallel using both cores on my laptop. Every now and then, however, I get to do something that requires much more computational resource. For the alpha testing of the Revolution R Enterprise 7 software which is scheduled for general availability later this year and which will support PEMAs running directly on Hadoop I was given access to a small, 5 node Hortonworks Hadoop cluster that Revolution Engineering set up to run as an Amazon EC2 instance. Data sets — both .csv files and Revolution .xdf files — were imported into the HDFS file system for me, and Revolution R Enterprise was pre-installed on every node in the cluster.

Getting access to the Hadoop cluster could not have been easier. All that I had to do was set up a Cygwin shell configured with OpenSSH and then set up the proper permissions in the .pem file that was provided to me and put the file in my Cygwin directory. Now, to fit a model using the Hadoop cluster all I have to do is run a few lines of R code that invoke my permissions and set my compute context for the Hortonworks cluster. The following script which I can run from almost any Palo Alto coffee shop fits a logistic regression model using data on the Hadoop cluster.

# This script shows code for executing RevoScaleR functions in an alpha-level version
# of Revolution R Enterprise (RRE) V7.0 on a Hadoop Cluster. The Hadoop cluster is running 
# remotely in an Amazon Ec2 cloud. The script assumes that an ssh connection has been established 
# with a Linux node running the JobTracker and NameNode for the Hadoop cluster
mySshUsername = 'user-name'		                   # Set user name
mySshHostname <- ""		           # Public facing cluster IP address
mySshSwitches <- "-i C:/cygwin/user-name.pem"              # Location of .pem permissions file
myHadoopCluster (sshUsername = mySshUsername, # Describe the Hadoop compute context
	               sshHostname = mySshHostname,
				   sshSwitches = mySshSwitches)
myNameNode <- "master.local"            # name of name node
myPort <- 8020				# Port number of Hadoop name node
bigDataDirRoot <- "/share"              # Location of the provided data
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)	# Create file system object
mortCsvDataDir <- file.path(bigDataDirRoot, "mortDefault/CSV")  # Specify path on Hadoop cluster
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)    # Generate a file system object
mortText <- RxTextData( mortCsvDataDir, fileSystem = hdfsFS )   # Set the data source
rxSetComputeContext(myHadoopCluster)    # Set the compute context
rxGetComputeContext()			# Check that the context has been reset
rxSummary(~., data = mortText)			# Summarize the data
# Fit a logistic regression model
logitObj (default ~ F(year) + creditScore + yearsEmploy + ccDebt,
					data = mortText, reportProgress = 1)
summary(logitObj)				# look at the output

Created by Pretty R at

The first section of the script after the initial comments sets up permissions and specifies the Hadoop compute context. The second section points to the data on the Hadoop cluster in much the same way that one would point to data on a local machine. Then there is a line of code that points to the Hadoop compute context. Following that, we have the code to execute an rxSummary() function to read and summarize the data which is in a .csv file in the HDFS file system, and an rxLogit() function that fits a logistic regression model to this data.

What happens when the script runs is basically the following. My local instance of Revolution R Enterprise recognizes the call to use the remote compute context and sets up the connection to Hadoop cluster using the permissions provided. Executing the rxLogit() function causes an instance of R 3.0.1 and Revolution R Enterprise 7 to fire up on the Hadoop JobTracker node. Behind the scenes, this kicks off a Hadoop Map/Reduce job. Since logistic regression is a implemented as an iterative algorithm this means that a different Map/Reduce job gets kicked off for each iteration. This cycle repeats until the regression converges or the limit for the number of iterations is reached. This file contains some of the output sent back to my R console from running the script. It shows the progress reported on the Map/Reduce jobs and a few other details that the Hadoop curious may find interesting.

Soon running Map/Reduce jobs on Hadoop scale data sets will be within the reach of anyone with a basic R skills and access to Revolution R Enterprise. (Note that when it is released, Revolution R Enterprise 7 will support both Hortonworks 1.3 and Cloudera’s CDH3 and CDH4.)

For more information on Revolution and Hadoop have a look at the recording of Revolution developer Mario Inchiosa’s recent webinar and don’t miss the webinar describing Revolution and Hortonworks integration coming up on 9/24.

by Joseph Rickert