Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics and truck accident claims
    How Data Analytics Reduces Truck Accidents and Speeds Up Claims
    7 Min Read
    predictive analytics for interior designers
    Interior Designers Boost Profits with Predictive Analytics
    8 Min Read
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Jeff Hammerbacher on Experiences Evolving a New Analytical Platform
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Jeff Hammerbacher on Experiences Evolving a New Analytical Platform
Data Mining

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Daniel Tunkelang
Daniel Tunkelang
0 Min Read
SHARE

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The third speaker in the program was Cloudera co-founder and Chief Scientist Jeff Hammerbacher. Jeff, recently hailed by Tim O’Reilly as one of the world’s most powerful data scientists, built the Facebook Data Team, which is most known for open-source contributions that include Hive and Cassandra. Jeff’s talk was entitled “Experiences Evolving a New Analytical Platform: What Works and What’s Missing“. I am thankful to Jeff Dalton for live-blogging a summary.

More Read

2009: Products I Can’t Live Without
Cloud Nine
Finding, Locating, Discovering
Data Mining Research Interview: Roberto Battiti
How Big Data Can Improve Manufacturing Quality

Jeff’s talk was a whirlwind tour through the philosophy and technology for delivering large-scale analytics (aka “big data”) to the world:

1) Philosophy

The true challenges in the task of data mining are creating a data set with the relevant and accurate information and determining the appropriate analysis techniques. While in the past it made sense to plan data storage and structure around the intended use of the data, the economics of storage and the availability of open-source analytics platforms argue for the reverse: data first, ask questions later; store first, establish structure later. The goal is to enable everyone — developers, analysts, business users — to “party on the data”, providing infrastructure that keeps them from clobbering one another or starving each other of resources.

2) Defining the Platform

No one just uses a relational database anymore. For example, consider Microsoft SQL Server. It is actually part of a unified suite that includes SharePoint for collaboration, PowerPivot for OLAP, StreamInsight for complex event processing (CEP), etc. As with the LAMP stack, there is a coherent framework analytical data management which we can call an analytical data platform.

3) Cloudera’s Platform

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch  (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

4) What’s Next?

For the substrate, we can expect support for fat servers with fat pipes, operating system support for isolation, and improved local filesystems (e.g., btrfs). Storage improvements will give us a unified file format, compression, better performance and availability, richer metadata, distributed snapshots, replication across data centers, native client access, and separation of namespace and block management. We will see stabilization of our existing compute tools and better variety, as well as improved fault tolerance, isolation and workload management, low-latency job scheduling, and a unified execution backend for workflow. And we will see better integration through REST API access to all platform components, better document ingest, maintenance of source catalog and provenance information, and an integration beyond ODBC with analytics tools. We will also see tools that facilitate that transition from unstructured to structured data (e.g. RecordBreaker).

Jeff’s talk was as information-dense as this post suggests, and I hope the mostly-academic CIKM audience was not too shell-shocked. It’s fantastic to see practitioners not only building essential tools for research in information and knowledge management, but reaching out to the research community to build bridges. I saw lots of intense conversation after his talk, and I hope the results realize the two-fold mission of the Industry Event, which is to give  researchers an opportunity to learn about the problems most relevant to industry practitioners, and to offer practitioners an opportunity to deepen their understanding of the field in which they are working.

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

data analytics and truck accident claims
How Data Analytics Reduces Truck Accidents and Speeds Up Claims
Analytics Big Data Exclusive
predictive analytics for interior designers
Interior Designers Boost Profits with Predictive Analytics
Analytics Exclusive Predictive Analytics
big data and cybercrime
Stopping Lateral Movement in a Data-Heavy, Edge-First World
Big Data Exclusive
AI and data mining
What the Rise of AI Web Scrapers Means for Data Teams
Artificial Intelligence Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

HR Analytics
Data MiningData VisualizationDecision ManagementKnowledge ManagementRisk ManagementWorkforce AnalyticsWorkforce Data

Workforce Planning and HR Analytics

5 Min Read

PAW: Predictive Modeling for E-Mail Marketing

7 Min Read
Automation Tools
Big DataBusiness IntelligenceData ManagementData Mining

3 Ways Automation Tools Use Big Data To Drive Business Growth

6 Min Read

Don’t think percentages, think real people

2 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?