Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics risk management
    How Predictive Analytics Is Redefining Risk Management Across Industries
    7 Min Read
    data analytics and gold trading
    Data Analytics and the New Era of Gold Trading
    9 Min Read
    composable analytics
    How Composable Analytics Unlocks Modular Agility for Data Teams
    9 Min Read
    data mining to find the right poly bag makers
    Using Data Analytics to Choose the Best Poly Mailer Bags
    12 Min Read
    data analytics for pharmacy trends
    How Data Analytics Is Tracking Trends in the Pharmacy Industry
    5 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: The Data Lake Debate: Pro Delivers First Rebuttal
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Software > Hadoop > The Data Lake Debate: Pro Delivers First Rebuttal
Big DataData ManagementHadoopOpen SourcePolicy and Governance

The Data Lake Debate: Pro Delivers First Rebuttal

TamaraDull
TamaraDull
5 Min Read
Image
SHARE

Image

Contents
  • Revisiting Definitions (Again!)
  • And the Alternative is…
  • Without Purpose is Okay

Image

ImageIn keeping with the spirit of this Lincoln-Douglas debate format, it looks like I only have 4 minutes (or approximately 600 words) to rebut the anti-data lake arguments Anne presented in this post and this one. Let’s do it!

Timer: START!

More Read

predictive analytics and Hadoop weather forecasting
Hadoop-Based Predictive Analytics Improves Extreme Weather Forecasting Models
Using Data Analysis to Avoid 4 Common Causes of Business Failure
5 Ways to Use Big Data to Run a Successful Food Franchise
The Internet Needs a Bill of Rights Before It’s Too Late
Social Business and Digital Strategy

One of the challenges in this debate – at least for me – is that Anne and I seem to be operating on different definitions of two key terms in this discussion: data lake and Hadoop. The reason I bring this up is because you see this same confusion, or lack of clarity, elsewhere. So that’s where I’d like to start.

Revisiting Definitions (Again!)

About the data lake. In my opening argument, I defined the data lake as a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. I also mentioned that a data lake can take on different shapes and sizes, and provided these examples:

  • A single data lake; or
  • A data lake with multiple data ponds—similar in concept to a data warehouse/data mart model; or
  • Multiple, decentralized data lakes; or
  • A virtual data lake to reduce data movement.

Whereby I’ve been operating under a more logical-based definition of a data lake during this debate, Anne’s been more focused on a single, physical storage repository in her arguments.

About Hadoop. Hadoop has two primary meanings: it’s both an open source project and an ecosystem of related projects and technologies. Here’s how they differ:

  • Open source project. When Hadoop made its commercial debut, much of the discussion was around Apache Hadoop, an open source project released by the Apache Software Foundation. Apache Hadoop was built to do two things: store and process any and all kinds of data.
  • Ecosystem. Today, when you hear discussions of Hadoop, it’s more likely about the ecosystem of projects – both open source and proprietary – that work with Apache Hadoop to make it a more robust data-everything platform. Apache Hadoop was never intended to do it all. The Hadoop ecosystem, however, is hell-bent on doing it all – and then some.

During this debate, when I’ve mentioned Hadoop, I’ve been referring to the Hadoop ecosystem. From what I can tell from Anne’s arguments, she’s been talking about Apache Hadoop. Again, same word, different uses.

And the Alternative is…

Throughout Anne’s argument, she points out the shortcomings of using Apache Hadoop (not the ecosystem) as a data lake. Point taken. But when I asked what organizations are supposed to do when the majority of their data (80-90%) is not sitting in pristine data structures, Anne replied, “It is not the storage and access [of Apache Hadoop] that brings the advantage. The advantage is in the insights derived from the analysis of the data.” What’s still not clear is how and where this analysis is taking place. If a Hadoop-based data lake is not the answer, then what is? 

Without Purpose is Okay

You can see Anne squirming – just like fingernails on a chalkboard – anytime someone mentions collecting and storing data without a purpose or business context. She retaliates with “There’s no value to the organization!” Au contraire, mon ami! Tell Amazon that. They haven’t thrown any data away since day 1. Do you think they knew they’d be getting a patent for anticipatory shipping – i.e., shipping your package before you buy it – when they first started out over 20 years ago?

Today, we have big data technologies, like the Hadoop ecosystem, that allow organizations to collect and store any and all data at a fraction of the cost. I fully agree with Anne that “just because you can doesn’t mean you should” –but I would also contend that just because you can’t define the purpose now doesn’t mean you shouldn’t collect and store it. Don’t be afraid to embrace the unknown unknowns in your data.

Timer: STOP! Total word count: 598


Previously in the Data Lake Debate:

  • The Introduction – by Jill Dyche
  • Pro’s Up First – by Tamara Dull
  • Questioning the Pro – by Anne Buff and Tamara Dull
  • Negative Puts a Stake in the Ground – by Anne Buff
  • Pro Cross-Examines Con – by Tamara Dull and Anne Buff


TAGGED:Data Lake Debate
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

street address database
Why Data-Driven Companies Rely on Accurate Street Address Databases
Big Data Exclusive
predictive analytics risk management
How Predictive Analytics Is Redefining Risk Management Across Industries
Analytics Exclusive Predictive Analytics
data analytics and gold trading
Data Analytics and the New Era of Gold Trading
Analytics Big Data Exclusive
student learning AI
Advanced Degrees Still Matter in an AI-Driven Job Market
Artificial Intelligence Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Data Lake Debate
Big DataData ManagementHadoopOpen SourcePolicy and Governance

The Data Lake Debate: The Final Word from Negative

8 Min Read
Image
Big DataHadoop

The Data Lake Debate: Pro is Up First

8 Min Read
Data Lake Debate
Big DataData ManagementHadoopPolicy and Governance

The Data Lake Debate: Questioning the Pro

8 Min Read
Image
Big DataData ManagementHadoopPolicy and Governance

The Data Lake Debate: Conclusion (With Apologies to the Rolling Stones)

4 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?