Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: The Data Lake Debate: Pro Delivers First Rebuttal
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Software > Hadoop > The Data Lake Debate: Pro Delivers First Rebuttal
Big DataData ManagementHadoopOpen SourcePolicy and Governance

The Data Lake Debate: Pro Delivers First Rebuttal

TamaraDull
TamaraDull
5 Min Read
Image
SHARE

Image

Contents
Revisiting Definitions (Again!)And the Alternative is…Without Purpose is Okay

Image

ImageIn keeping with the spirit of this Lincoln-Douglas debate format, it looks like I only have 4 minutes (or approximately 600 words) to rebut the anti-data lake arguments Anne presented in this post and this one. Let’s do it!

Timer: START!

More Read

Building an Analytical Portal to Support Analytical Culture
Potential Impediments to the Rise of Big Data
How Heineken Interacts With Customers Using Big Data [VIDEO]
Is Your CRM Data the Elephant in the Room?
How the New Revenue Recognition Rules Could Impact Budgeting and Planning

One of the challenges in this debate – at least for me – is that Anne and I seem to be operating on different definitions of two key terms in this discussion: data lake and Hadoop. The reason I bring this up is because you see this same confusion, or lack of clarity, elsewhere. So that’s where I’d like to start.

Revisiting Definitions (Again!)

About the data lake. In my opening argument, I defined the data lake as a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. I also mentioned that a data lake can take on different shapes and sizes, and provided these examples:

  • A single data lake; or
  • A data lake with multiple data ponds—similar in concept to a data warehouse/data mart model; or
  • Multiple, decentralized data lakes; or
  • A virtual data lake to reduce data movement.

Whereby I’ve been operating under a more logical-based definition of a data lake during this debate, Anne’s been more focused on a single, physical storage repository in her arguments.

About Hadoop. Hadoop has two primary meanings: it’s both an open source project and an ecosystem of related projects and technologies. Here’s how they differ:

  • Open source project. When Hadoop made its commercial debut, much of the discussion was around Apache Hadoop, an open source project released by the Apache Software Foundation. Apache Hadoop was built to do two things: store and process any and all kinds of data.
  • Ecosystem. Today, when you hear discussions of Hadoop, it’s more likely about the ecosystem of projects – both open source and proprietary – that work with Apache Hadoop to make it a more robust data-everything platform. Apache Hadoop was never intended to do it all. The Hadoop ecosystem, however, is hell-bent on doing it all – and then some.

During this debate, when I’ve mentioned Hadoop, I’ve been referring to the Hadoop ecosystem. From what I can tell from Anne’s arguments, she’s been talking about Apache Hadoop. Again, same word, different uses.

And the Alternative is…

Throughout Anne’s argument, she points out the shortcomings of using Apache Hadoop (not the ecosystem) as a data lake. Point taken. But when I asked what organizations are supposed to do when the majority of their data (80-90%) is not sitting in pristine data structures, Anne replied, “It is not the storage and access [of Apache Hadoop] that brings the advantage. The advantage is in the insights derived from the analysis of the data.” What’s still not clear is how and where this analysis is taking place. If a Hadoop-based data lake is not the answer, then what is? 

Without Purpose is Okay

You can see Anne squirming – just like fingernails on a chalkboard – anytime someone mentions collecting and storing data without a purpose or business context. She retaliates with “There’s no value to the organization!” Au contraire, mon ami! Tell Amazon that. They haven’t thrown any data away since day 1. Do you think they knew they’d be getting a patent for anticipatory shipping – i.e., shipping your package before you buy it – when they first started out over 20 years ago?

Today, we have big data technologies, like the Hadoop ecosystem, that allow organizations to collect and store any and all data at a fraction of the cost. I fully agree with Anne that “just because you can doesn’t mean you should” –but I would also contend that just because you can’t define the purpose now doesn’t mean you shouldn’t collect and store it. Don’t be afraid to embrace the unknown unknowns in your data.

Timer: STOP! Total word count: 598


Previously in the Data Lake Debate:

  • The Introduction – by Jill Dyche
  • Pro’s Up First – by Tamara Dull
  • Questioning the Pro – by Anne Buff and Tamara Dull
  • Negative Puts a Stake in the Ground – by Anne Buff
  • Pro Cross-Examines Con – by Tamara Dull and Anne Buff


TAGGED:Data Lake Debate
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Image
Data ManagementHadoopKnowledge ManagementOpen SourceUnstructured Data

The Data Lake Debate: Pro Cross-Examines Con

7 Min Read
Data Lake Debate
Big DataData ManagementHadoopPolicy and Governance

The Data Lake Debate: Negative Puts a Stake in the Ground

10 Min Read
Image
Best PracticesBig DataData ManagementData WarehousingHadoop

The Data Lake Debate: The Introduction

3 Min Read
Data Lake Debate
Big DataData ManagementHadoopOpen SourcePolicy and Governance

The Data Lake Debate: The Final Word from Negative

8 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?