By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    data science anayst
    Growing Demand for Data Science & Data Analyst Roles
    6 Min Read
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Alpha Testing RevoScaleR Running in Hadoop
Share
Notification Show More
Latest News
ai in automotive industry
AI Is Changing the Automotive Industry Forever
Artificial Intelligence
SMEs Use AI-Driven Financial Software for Greater Efficiency
Artificial Intelligence
data security in big data age
6 Reasons to Boost Data Security Plan in the Age of Big Data
Big Data
data science anayst
Growing Demand for Data Science & Data Analyst Roles
Data Science
ai software development
Key Strategies to Develop AI Software Cost-Effectively
Artificial Intelligence
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Software > Hadoop > Alpha Testing RevoScaleR Running in Hadoop
AnalyticsHadoopR Programming Language

Alpha Testing RevoScaleR Running in Hadoop

DavidMSmith
Last updated: 2013/09/16 at 3:08 PM
DavidMSmith
8 Min Read
Image
SHARE

ImageAt Revolution Analytics our mission is to establish R as the driver for Enterprise level computational frameworks. In part, this means that a data scientist ought to be able to develop an R based application in one context, e.g.

ImageAt Revolution Analytics our mission is to establish R as the driver for Enterprise level computational frameworks. In part, this means that a data scientist ought to be able to develop an R based application in one context, e.g. her local PC, and then get it moving by changing horses on the fly (so to speak) and have it run on a platform with more horsepower with minimum acrobatics. For example, I usually work on my Windows laptop using the IDE included with Revolution R Enterprise for Windows. Most of the time my compute context is set to either rxLocalSeq which indicates that all commands will execute sequentially on my notebook, or rxLocalParallel which enables RevoScaleR parallel external memory algorithms (PEMAs) to execute in parallel using both cores on my laptop. Every now and then, however, I get to do something that requires much more computational resource. For the alpha testing of the Revolution R Enterprise 7 software which is scheduled for general availability later this year and which will support PEMAs running directly on Hadoop I was given access to a small, 5 node Hortonworks Hadoop cluster that Revolution Engineering set up to run as an Amazon EC2 instance. Data sets — both .csv files and Revolution .xdf files — were imported into the HDFS file system for me, and Revolution R Enterprise was pre-installed on every node in the cluster.

Getting access to the Hadoop cluster could not have been easier. All that I had to do was set up a Cygwin shell configured with OpenSSH and then set up the proper permissions in the .pem file that was provided to me and put the file in my Cygwin directory. Now, to fit a model using the Hadoop cluster all I have to do is run a few lines of R code that invoke my permissions and set my compute context for the Hortonworks cluster. The following script which I can run from almost any Palo Alto coffee shop fits a logistic regression model using data on the Hadoop cluster.

#----------------------------------------------------------------------------------------------- # RUNNING REVOLUTION R ENTERPRISE 7.0 REVOSCALER FUNCTIONS ON A HADOOP CLUSTER # This script shows code for executing RevoScaleR functions in an alpha-level version # of Revolution R Enterprise (RRE) V7.0 on a Hadoop Cluster. The Hadoop cluster is running  # remotely in an Amazon Ec2 cloud. The script assumes that an ssh connection has been established  # with a Linux node running the JobTracker and NameNode for the Hadoop cluster #----------------------------------------------------------------------------------------------- # SET UP PERMISSIONS FOR ACCESSING THE HADOOP CLUSTER mySshUsername = 'user-name'                    # Set user name mySshHostname <- "xx.xxx.xxx.xxx"            # Public facing cluster IP address mySshSwitches <- "-i C:/cygwin/user-name.pem" # Location of .pem permissions file   myHadoopCluster (sshUsername = mySshUsername, # Describe the Hadoop compute context sshHostname = mySshHostname, sshSwitches = mySshSwitches)   myNameNode <- "master.local" # name of name node myPort <- 8020 # Port number of Hadoop name node bigDataDirRoot <- "/share" # Location of the provided data #------------------------------------------------------------------------------------------------ # POINT TO THE DATA ON THE HADOOP CLUSTER  hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort) # Create file system object mortCsvDataDir <- file.path(bigDataDirRoot, "mortDefault/CSV") # Specify path on Hadoop cluster hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort) # Generate a file system object mortText <- RxTextData( mortCsvDataDir, fileSystem = hdfsFS ) # Set the data source #------------------------------------------------------------------------------------------------- # CHANGE THE COMPUTE CONTEXT TO POINT TO THE HADOOP CLUSTER rxSetComputeContext(myHadoopCluster) # Set the compute context rxGetComputeContext() # Check that the context has been reset #------------------------------------------------------------------------------------------------ # DATA ANALYSIS rxSummary(~., data = mortText) # Summarize the data # Fit a logistic regression model logitObj (default ~ F(year) + creditScore + yearsEmploy + ccDebt, data = mortText, reportProgress = 1) summary(logitObj) # look at the output #-------------------------------------------------------------------------------------------------

Created by Pretty R at inside-R.org

More Read

data science anayst

Growing Demand for Data Science & Data Analyst Roles

Predictive Analytics Helps New Dropshipping Businesses Thrive
The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
Analytics Changes the Calculus of Business Tax Compliance
The Role of Big Data Analytics in Gaming

The first section of the script after the initial comments sets up permissions and specifies the Hadoop compute context. The second section points to the data on the Hadoop cluster in much the same way that one would point to data on a local machine. Then there is a line of code that points to the Hadoop compute context. Following that, we have the code to execute an rxSummary() function to read and summarize the data which is in a .csv file in the HDFS file system, and an rxLogit() function that fits a logistic regression model to this data.

What happens when the script runs is basically the following. My local instance of Revolution R Enterprise recognizes the call to use the remote compute context and sets up the connection to Hadoop cluster using the permissions provided. Executing the rxLogit() function causes an instance of R 3.0.1 and Revolution R Enterprise 7 to fire up on the Hadoop JobTracker node. Behind the scenes, this kicks off a Hadoop Map/Reduce job. Since logistic regression is a implemented as an iterative algorithm this means that a different Map/Reduce job gets kicked off for each iteration. This cycle repeats until the regression converges or the limit for the number of iterations is reached. This file contains some of the output sent back to my R console from running the script. It shows the progress reported on the Map/Reduce jobs and a few other details that the Hadoop curious may find interesting.

Soon running Map/Reduce jobs on Hadoop scale data sets will be within the reach of anyone with a basic R skills and access to Revolution R Enterprise. (Note that when it is released, Revolution R Enterprise 7 will support both Hortonworks 1.3 and Cloudera’s CDH3 and CDH4.)

For more information on Revolution and Hadoop have a look at the recording of Revolution developer Mario Inchiosa’s recent webinar and don’t miss the webinar describing Revolution and Hortonworks integration coming up on 9/24.

by Joseph Rickert

DavidMSmith September 16, 2013
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai in automotive industry
AI Is Changing the Automotive Industry Forever
Artificial Intelligence
SMEs Use AI-Driven Financial Software for Greater Efficiency
Artificial Intelligence
data security in big data age
6 Reasons to Boost Data Security Plan in the Age of Big Data
Big Data
data science anayst
Growing Demand for Data Science & Data Analyst Roles
Data Science

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

data science anayst
Data Science

Growing Demand for Data Science & Data Analyst Roles

6 Min Read
predictive analytics in dropshipping
Predictive Analytics

Predictive Analytics Helps New Dropshipping Businesses Thrive

12 Min Read
data-driven approach in healthcare
Analytics

The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas

6 Min Read
analytics for tax compliance
Analytics

Analytics Changes the Calculus of Business Tax Compliance

8 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?