By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    football analytics
    The Role of Data Analytics in Football Performance
    9 Min Read
    data Analytics instagram stories
    Data Analytics Helps Marketers Make the Most of Instagram Stories
    15 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    What to Know Before Recruiting an Analyst to Handle Company Data
    6 Min Read
    AI analytics
    AI-Based Analytics Are Changing the Future of Credit Cards
    6 Min Read
    data overload showing data analytics
    How Does Next-Gen SIEM Prevent Data Overload For Security Analysts?
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Terabytes of trees
Share
Notification Show More
Aa
SmartData CollectiveSmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Terabytes of trees
Data Mining

Terabytes of trees

DavidMSmith
Last updated: 2009/08/13 at 4:20 PM
DavidMSmith
4 Min Read
SHARE

I saw a very interesting talk at hosted by the SF Bay ACM last night. Google engineer Josh Herbach talked about the platform he’d implemented to build boosted and bagged trees on very large data sets using MapReduce. (A longer version of the talk will be presented at VLDB2009 later this month.) The data is distributed amongst many machines in gfs (Google Filesystem): Google Adwords data, with information on each user of Google Search and each click they have made, can run to terabytes and take three days to build a predictive tree. 

The algorithm is quite elegant: after an initialization phase to identify candidate cut-points for continuous predictors and values of categorical variables, the Map step selects a node to add a new chunk of data to, and then calculates a deviance score for a number of candidate splits. The reduce step selects the best split from the various candidates evaluated in the distributed nodes. The process repeats to create a single tree or (as is actually used in practice) a number of bagged and/or boosted trees. One interesting wrinkle: for implementation reasons, the bagged trees use sampling without replacement rather than with …

I saw a very interesting talk at hosted by the SF Bay ACM last night. Google engineer Josh Herbach talked about the platform he’d implemented to build boosted and bagged trees on very large data sets using MapReduce. (A longer version of the talk will be presented at VLDB2009 later this month.) The data is distributed amongst many machines in gfs (Google Filesystem): Google Adwords data, with information on each user of Google Search and each click they have made, can run to terabytes and take three days to build a predictive tree. 

More Read

using hadoop for email marketing scalability

Scalability-focused Email Marketing Solutions that Incorporate Hadoop

Hadoop Data Mining Tools Can Enhance The Value Of Digital Assets
How Big Data and Hadoop Training Programs Can Make a Big Difference
Big Data New Age: Hadoop vs Spark
How Hadoop Tools Shape SAP Hana’s Big Data Platform
The algorithm is quite elegant: after an initialization phase to identify candidate cut-points for continuous predictors and values of categorical variables, the Map step selects a node to add a new chunk of data to, and then calculates a deviance score for a number of candidate splits. The reduce step selects the best split from the various candidates evaluated in the distributed nodes. The process repeats to create a single tree or (as is actually used in practice) a number of bagged and/or boosted trees. One interesting wrinkle: for implementation reasons, the bagged trees use sampling without replacement rather than with replacement (as bagging is usually defined). Given the amount of data, I’m not sure this makes any practical difference though. Interestingly, he did compare the results to heavily sampling the data and building the tree in-memory in R (all of his charts were done in R, too). He was quite adamant that using all of the data is “worth it” compared to sampling — and with Google’s business model of monetizing the long tail, I can believe it. 

Josh mentioned that all of the techniques he’d implemented could also be implemented using Hadoop, the open-source map-reduce application. This got me thinking that some interesting out-of-memory techniques could be implemented in R via Rhipe, using R statistics functions to implement the Map operations, and R data aggregation for the Reduce functions. Hmm, I feel a new project coming on…

SF Bay ACM: PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce

Link to original post

TAGGED: hadoop, MapReduce, r
DavidMSmith August 13, 2009
Share This Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

Shutterstock Licensed Photo - 1051059293 | Rawpixel.com
QR Codes Leverage the Benefits of Big Data in Education
Big Data
football analytics
The Role of Data Analytics in Football Performance
Analytics Big Data Exclusive
smart home data
7 Mind-Blowing Ways Smart Homes Use Data to Save Your Money
Big Data
ai low code frameworks
AI Can Help Accelerate Development with Low-Code Frameworks
Artificial Intelligence

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

using hadoop for email marketing scalability
Hadoop

Scalability-focused Email Marketing Solutions that Incorporate Hadoop

6 Min Read
hadoop data mining tools
Software

Hadoop Data Mining Tools Can Enhance The Value Of Digital Assets

6 Min Read
big data and Hadoop guide
AnalyticsBig DataExclusiveHadoopSoftware

How Big Data and Hadoop Training Programs Can Make a Big Difference

5 Min Read
Hadoop vs Spark
Big DataHadoopMapReduceProgramming

Big Data New Age: Hadoop vs Spark

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive
data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?