Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    How Data Analytics Is Reshaping Patient Financing Decisions
    How Data Analytics Is Reshaping Patient Financing Decisions
    13 Min Read
    business using business intelligence
    How to Use a Competitive Intelligence Dashboard to Turn Market Data Into Smarter Marketing Decisions 
    9 Min Read
    unusual trading activity
    Signal Or Noise? A Decision Tree For Evaluating Unusual Trading Activity
    3 Min Read
    software developer using ai
    How Data Analytics Helps Developers Deliver Better Tech Services
    8 Min Read
    ai for stock trading
    Can Data Analytics Help Investors Outperform Warren Buffett
    9 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Alpha Testing RevoScaleR Running in Hadoop
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Software > Hadoop > Alpha Testing RevoScaleR Running in Hadoop
AnalyticsHadoopR Programming Language

Alpha Testing RevoScaleR Running in Hadoop

DavidMSmith
DavidMSmith
8 Min Read
Image
SHARE

ImageAt Revolution Analytics our mission is to establish R as the driver for Enterprise level computational frameworks. In part, this means that a data scientist ought to be able to develop an R based application in one context, e.g.

ImageAt Revolution Analytics our mission is to establish R as the driver for Enterprise level computational frameworks. In part, this means that a data scientist ought to be able to develop an R based application in one context, e.g. her local PC, and then get it moving by changing horses on the fly (so to speak) and have it run on a platform with more horsepower with minimum acrobatics. For example, I usually work on my Windows laptop using the IDE included with Revolution R Enterprise for Windows. Most of the time my compute context is set to either rxLocalSeq which indicates that all commands will execute sequentially on my notebook, or rxLocalParallel which enables RevoScaleR parallel external memory algorithms (PEMAs) to execute in parallel using both cores on my laptop. Every now and then, however, I get to do something that requires much more computational resource. For the alpha testing of the Revolution R Enterprise 7 software which is scheduled for general availability later this year and which will support PEMAs running directly on Hadoop I was given access to a small, 5 node Hortonworks Hadoop cluster that Revolution Engineering set up to run as an Amazon EC2 instance. Data sets — both .csv files and Revolution .xdf files — were imported into the HDFS file system for me, and Revolution R Enterprise was pre-installed on every node in the cluster.

Getting access to the Hadoop cluster could not have been easier. All that I had to do was set up a Cygwin shell configured with OpenSSH and then set up the proper permissions in the .pem file that was provided to me and put the file in my Cygwin directory. Now, to fit a model using the Hadoop cluster all I have to do is run a few lines of R code that invoke my permissions and set my compute context for the Hortonworks cluster. The following script which I can run from almost any Palo Alto coffee shop fits a logistic regression model using data on the Hadoop cluster.

#----------------------------------------------------------------------------------------------- # RUNNING REVOLUTION R ENTERPRISE 7.0 REVOSCALER FUNCTIONS ON A HADOOP CLUSTER # This script shows code for executing RevoScaleR functions in an alpha-level version # of Revolution R Enterprise (RRE) V7.0 on a Hadoop Cluster. The Hadoop cluster is running  # remotely in an Amazon Ec2 cloud. The script assumes that an ssh connection has been established  # with a Linux node running the JobTracker and NameNode for the Hadoop cluster #----------------------------------------------------------------------------------------------- # SET UP PERMISSIONS FOR ACCESSING THE HADOOP CLUSTER mySshUsername = 'user-name'                    # Set user name mySshHostname <- "xx.xxx.xxx.xxx"            # Public facing cluster IP address mySshSwitches <- "-i C:/cygwin/user-name.pem" # Location of .pem permissions file   myHadoopCluster (sshUsername = mySshUsername, # Describe the Hadoop compute context sshHostname = mySshHostname, sshSwitches = mySshSwitches)   myNameNode <- "master.local" # name of name node myPort <- 8020 # Port number of Hadoop name node bigDataDirRoot <- "/share" # Location of the provided data #------------------------------------------------------------------------------------------------ # POINT TO THE DATA ON THE HADOOP CLUSTER  hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort) # Create file system object mortCsvDataDir <- file.path(bigDataDirRoot, "mortDefault/CSV") # Specify path on Hadoop cluster hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort) # Generate a file system object mortText <- RxTextData( mortCsvDataDir, fileSystem = hdfsFS ) # Set the data source #------------------------------------------------------------------------------------------------- # CHANGE THE COMPUTE CONTEXT TO POINT TO THE HADOOP CLUSTER rxSetComputeContext(myHadoopCluster) # Set the compute context rxGetComputeContext() # Check that the context has been reset #------------------------------------------------------------------------------------------------ # DATA ANALYSIS rxSummary(~., data = mortText) # Summarize the data # Fit a logistic regression model logitObj (default ~ F(year) + creditScore + yearsEmploy + ccDebt, data = mortText, reportProgress = 1) summary(logitObj) # look at the output #-------------------------------------------------------------------------------------------------

Created by Pretty R at inside-R.org

More Read

The Quantified Self, Part I: Will it Lead to Better Data Management?
How Big Data can Help Marketers Cater to an International Audience
Big Data Offers Remarkable Valuation Tools for Cryptocurrency Speculators
Another analyst firm (Ventana) gets it
Attensity Uses Social Media Technology for Smarter Customer Engagement

The first section of the script after the initial comments sets up permissions and specifies the Hadoop compute context. The second section points to the data on the Hadoop cluster in much the same way that one would point to data on a local machine. Then there is a line of code that points to the Hadoop compute context. Following that, we have the code to execute an rxSummary() function to read and summarize the data which is in a .csv file in the HDFS file system, and an rxLogit() function that fits a logistic regression model to this data.

What happens when the script runs is basically the following. My local instance of Revolution R Enterprise recognizes the call to use the remote compute context and sets up the connection to Hadoop cluster using the permissions provided. Executing the rxLogit() function causes an instance of R 3.0.1 and Revolution R Enterprise 7 to fire up on the Hadoop JobTracker node. Behind the scenes, this kicks off a Hadoop Map/Reduce job. Since logistic regression is a implemented as an iterative algorithm this means that a different Map/Reduce job gets kicked off for each iteration. This cycle repeats until the regression converges or the limit for the number of iterations is reached. This file contains some of the output sent back to my R console from running the script. It shows the progress reported on the Map/Reduce jobs and a few other details that the Hadoop curious may find interesting.

Soon running Map/Reduce jobs on Hadoop scale data sets will be within the reach of anyone with a basic R skills and access to Revolution R Enterprise. (Note that when it is released, Revolution R Enterprise 7 will support both Hortonworks 1.3 and Cloudera’s CDH3 and CDH4.)

For more information on Revolution and Hadoop have a look at the recording of Revolution developer Mario Inchiosa’s recent webinar and don’t miss the webinar describing Revolution and Hortonworks integration coming up on 9/24.

by Joseph Rickert

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

Operational Data Becomes Business Value in the Age of AIoT
Operational Data Becomes Business Value in the Age of AIoT
Big Data Exclusive Internet of Things
ai for social media
How AI Helps Businesses Get More From Social Media
Artificial Intelligence Exclusive
How Data Analytics Is Reshaping Patient Financing Decisions
How Data Analytics Is Reshaping Patient Financing Decisions
Analytics Big Data Exclusive
AI driven big data company
How AI-Driven Workflows Are Changing the Way Companies Think About Data Risk
Artificial Intelligence Data Management Exclusive Risk Management

Stay Connected

1.2KFollowersLike
33.7KFollowersFollow
222FollowersPin

You Might also Like

Text Analytics vs. Other Research Methods [VIDEO]

2 Min Read

Encryption is the key to the data kingdom

2 Min Read

Explaining Real-Time Predictive Analytics with Big Data [VIDEO]

1 Min Read

Judging Complete for 2011 Government Big Data Solutions Award

2 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai chatbot
The Art of Conversation: Enhancing Chatbots with Advanced AI Prompts
Chatbots
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?