Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    big data analytics in transporation
    Turning Data Into Decisions: How Analytics Improves Transportation Strategy
    3 Min Read
    sales and data analytics
    How Data Analytics Improves Lead Management and Sales Results
    9 Min Read
    data analytics and truck accident claims
    How Data Analytics Reduces Truck Accidents and Speeds Up Claims
    7 Min Read
    predictive analytics for interior designers
    Interior Designers Boost Profits with Predictive Analytics
    8 Min Read
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Terabytes of trees
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Terabytes of trees
Data Mining

Terabytes of trees

DavidMSmith
DavidMSmith
4 Min Read
SHARE

I saw a very interesting talk at hosted by the SF Bay ACM last night. Google engineer Josh Herbach talked about the platform he’d implemented to build boosted and bagged trees on very large data sets using MapReduce. (A longer version of the talk will be presented at VLDB2009 later this month.) The data is distributed amongst many machines in gfs (Google Filesystem): Google Adwords data, with information on each user of Google Search and each click they have made, can run to terabytes and take three days to build a predictive tree. 

The algorithm is quite elegant: after an initialization phase to identify candidate cut-points for continuous predictors and values of categorical variables, the Map step selects a node to add a new chunk of data to, and then calculates a deviance score for a number of candidate splits. The reduce step selects the best split from the various candidates evaluated in the distributed nodes. The process repeats to create a single tree or (as is actually used in practice) a number of bagged and/or boosted trees. One interesting wrinkle: for implementation reasons, the bagged trees use sampling without replacement rather than with …

I saw a very interesting talk at hosted by the SF Bay ACM last night. Google engineer Josh Herbach talked about the platform he’d implemented to build boosted and bagged trees on very large data sets using MapReduce. (A longer version of the talk will be presented at VLDB2009 later this month.) The data is distributed amongst many machines in gfs (Google Filesystem): Google Adwords data, with information on each user of Google Search and each click they have made, can run to terabytes and take three days to build a predictive tree. 

More Read

Data Preprocessing – Normalization
Hospitality Technology (Or Lack Thereof) – What is the Insight ROI?
Technologies are being developed that enable tiny computing…
2015: The Year of IoT Pioneers, Analytics and Data Privacy
New Webinar on Putting Predictive Analytics to Work
The algorithm is quite elegant: after an initialization phase to identify candidate cut-points for continuous predictors and values of categorical variables, the Map step selects a node to add a new chunk of data to, and then calculates a deviance score for a number of candidate splits. The reduce step selects the best split from the various candidates evaluated in the distributed nodes. The process repeats to create a single tree or (as is actually used in practice) a number of bagged and/or boosted trees. One interesting wrinkle: for implementation reasons, the bagged trees use sampling without replacement rather than with replacement (as bagging is usually defined). Given the amount of data, I’m not sure this makes any practical difference though. Interestingly, he did compare the results to heavily sampling the data and building the tree in-memory in R (all of his charts were done in R, too). He was quite adamant that using all of the data is “worth it” compared to sampling — and with Google’s business model of monetizing the long tail, I can believe it. 
Josh mentioned that all of the techniques he’d implemented could also be implemented using Hadoop, the open-source map-reduce application. This got me thinking that some interesting out-of-memory techniques could be implemented in R via Rhipe, using R statistics functions to implement the Map operations, and R data aggregation for the Reduce functions. Hmm, I feel a new project coming on…

SF Bay ACM: PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce

Link to original post

TAGGED:hadoopMapReducer
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

AI role in medical industry
The Role Of AI In Transforming Medical Manufacturing
Artificial Intelligence Exclusive
b2b sales
Unseen Barriers: Identifying Bottlenecks In B2B Sales
Business Rules Exclusive Infographic
data intelligence in healthcare
How Data Is Powering Real-Time Intelligence in Health Systems
Big Data Exclusive
intersection of data
The Intersection of Data and Empathy in Modern Support Careers
Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

HadoopDB discussion with Daniel Abadi

4 Min Read

R Examples for Actuaries

3 Min Read

Fascination with Hadoop pushes, pulls Big Data analytics into mainstream. (Part One)

6 Min Read

Big Data

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai chatbot
The Art of Conversation: Enhancing Chatbots with Advanced AI Prompts
Chatbots
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?