Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Alpha Testing RevoScaleR Running in Hadoop
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Software > Hadoop > Alpha Testing RevoScaleR Running in Hadoop
AnalyticsHadoopR Programming Language

Alpha Testing RevoScaleR Running in Hadoop

DavidMSmith
DavidMSmith
8 Min Read
Image
SHARE

ImageAt Revolution Analytics our mission is to establish R as the driver for Enterprise level computational frameworks. In part, this means that a data scientist ought to be able to develop an R based application in one context, e.g.

ImageAt Revolution Analytics our mission is to establish R as the driver for Enterprise level computational frameworks. In part, this means that a data scientist ought to be able to develop an R based application in one context, e.g. her local PC, and then get it moving by changing horses on the fly (so to speak) and have it run on a platform with more horsepower with minimum acrobatics. For example, I usually work on my Windows laptop using the IDE included with Revolution R Enterprise for Windows. Most of the time my compute context is set to either rxLocalSeq which indicates that all commands will execute sequentially on my notebook, or rxLocalParallel which enables RevoScaleR parallel external memory algorithms (PEMAs) to execute in parallel using both cores on my laptop. Every now and then, however, I get to do something that requires much more computational resource. For the alpha testing of the Revolution R Enterprise 7 software which is scheduled for general availability later this year and which will support PEMAs running directly on Hadoop I was given access to a small, 5 node Hortonworks Hadoop cluster that Revolution Engineering set up to run as an Amazon EC2 instance. Data sets — both .csv files and Revolution .xdf files — were imported into the HDFS file system for me, and Revolution R Enterprise was pre-installed on every node in the cluster.

Getting access to the Hadoop cluster could not have been easier. All that I had to do was set up a Cygwin shell configured with OpenSSH and then set up the proper permissions in the .pem file that was provided to me and put the file in my Cygwin directory. Now, to fit a model using the Hadoop cluster all I have to do is run a few lines of R code that invoke my permissions and set my compute context for the Hortonworks cluster. The following script which I can run from almost any Palo Alto coffee shop fits a logistic regression model using data on the Hadoop cluster.

#----------------------------------------------------------------------------------------------- # RUNNING REVOLUTION R ENTERPRISE 7.0 REVOSCALER FUNCTIONS ON A HADOOP CLUSTER # This script shows code for executing RevoScaleR functions in an alpha-level version # of Revolution R Enterprise (RRE) V7.0 on a Hadoop Cluster. The Hadoop cluster is running  # remotely in an Amazon Ec2 cloud. The script assumes that an ssh connection has been established  # with a Linux node running the JobTracker and NameNode for the Hadoop cluster #----------------------------------------------------------------------------------------------- # SET UP PERMISSIONS FOR ACCESSING THE HADOOP CLUSTER mySshUsername = 'user-name'                    # Set user name mySshHostname <- "xx.xxx.xxx.xxx"            # Public facing cluster IP address mySshSwitches <- "-i C:/cygwin/user-name.pem" # Location of .pem permissions file   myHadoopCluster (sshUsername = mySshUsername, # Describe the Hadoop compute context sshHostname = mySshHostname, sshSwitches = mySshSwitches)   myNameNode <- "master.local" # name of name node myPort <- 8020 # Port number of Hadoop name node bigDataDirRoot <- "/share" # Location of the provided data #------------------------------------------------------------------------------------------------ # POINT TO THE DATA ON THE HADOOP CLUSTER  hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort) # Create file system object mortCsvDataDir <- file.path(bigDataDirRoot, "mortDefault/CSV") # Specify path on Hadoop cluster hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort) # Generate a file system object mortText <- RxTextData( mortCsvDataDir, fileSystem = hdfsFS ) # Set the data source #------------------------------------------------------------------------------------------------- # CHANGE THE COMPUTE CONTEXT TO POINT TO THE HADOOP CLUSTER rxSetComputeContext(myHadoopCluster) # Set the compute context rxGetComputeContext() # Check that the context has been reset #------------------------------------------------------------------------------------------------ # DATA ANALYSIS rxSummary(~., data = mortText) # Summarize the data # Fit a logistic regression model logitObj (default ~ F(year) + creditScore + yearsEmploy + ccDebt, data = mortText, reportProgress = 1) summary(logitObj) # look at the output #-------------------------------------------------------------------------------------------------

Created by Pretty R at inside-R.org

More Read

Social Media, Corporate Decisions and Analytics
“I’m convinced that after years stuck with only…
Social Analytics: New Uses of Social Intelligence
The Data Driven World: Is it possible to Leverage on Analytics to Recognise Market Trends?
Sticking vs Backsliding – It Is So Sad

The first section of the script after the initial comments sets up permissions and specifies the Hadoop compute context. The second section points to the data on the Hadoop cluster in much the same way that one would point to data on a local machine. Then there is a line of code that points to the Hadoop compute context. Following that, we have the code to execute an rxSummary() function to read and summarize the data which is in a .csv file in the HDFS file system, and an rxLogit() function that fits a logistic regression model to this data.

What happens when the script runs is basically the following. My local instance of Revolution R Enterprise recognizes the call to use the remote compute context and sets up the connection to Hadoop cluster using the permissions provided. Executing the rxLogit() function causes an instance of R 3.0.1 and Revolution R Enterprise 7 to fire up on the Hadoop JobTracker node. Behind the scenes, this kicks off a Hadoop Map/Reduce job. Since logistic regression is a implemented as an iterative algorithm this means that a different Map/Reduce job gets kicked off for each iteration. This cycle repeats until the regression converges or the limit for the number of iterations is reached. This file contains some of the output sent back to my R console from running the script. It shows the progress reported on the Map/Reduce jobs and a few other details that the Hadoop curious may find interesting.

Soon running Map/Reduce jobs on Hadoop scale data sets will be within the reach of anyone with a basic R skills and access to Revolution R Enterprise. (Note that when it is released, Revolution R Enterprise 7 will support both Hortonworks 1.3 and Cloudera’s CDH3 and CDH4.)

For more information on Revolution and Hadoop have a look at the recording of Revolution developer Mario Inchiosa’s recent webinar and don’t miss the webinar describing Revolution and Hortonworks integration coming up on 9/24.

by Joseph Rickert

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Data Design Matters

3 Min Read
Fintech collecting web data
AnalyticsData MiningMarket Research

How Fintech is Using Web Data For Financial Intelligence

8 Min Read

Tackling Human Intelligence

3 Min Read

What Every CEO Needs to Know About Key Supporting Technologies

10 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?