The Data Lake Debate: Pro Delivers First Rebuttal




ImageIn keeping with the spirit of this Lincoln-Douglas debate format, it looks like I only have 4 minutes (or approximately 600 words) to rebut the anti-data lake arguments Anne presented in this post and this one. Let’s do it!

Timer: START!

One of the challenges in this debate – at least for me – is that Anne and I seem to be operating on different definitions of two key terms in this discussion: data lake and Hadoop. The reason I bring this up is because you see this same confusion, or lack of clarity, elsewhere. So that’s where I’d like to start.

Revisiting Definitions (Again!)

About the data lake. In my opening argument, I defined the data lake as a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. I also mentioned that a data lake can take on different shapes and sizes, and provided these examples:

  • A single data lake; or
  • A data lake with multiple data ponds—similar in concept to a data warehouse/data mart model; or
  • Multiple, decentralized data lakes; or
  • A virtual data lake to reduce data movement.

Whereby I’ve been operating under a more logical-based definition of a data lake during this debate, Anne’s been more focused on a single, physical storage repository in her arguments.

About Hadoop. Hadoop has two primary meanings: it’s both an open source project and an ecosystem of related projects and technologies. Here’s how they differ:

  • Open source project. When Hadoop made its commercial debut, much of the discussion was around Apache Hadoop, an open source project released by the Apache Software Foundation. Apache Hadoop was built to do two things: store and process any and all kinds of data.
  • Ecosystem. Today, when you hear discussions of Hadoop, it’s more likely about the ecosystem of projects – both open source and proprietary – that work with Apache Hadoop to make it a more robust data-everything platform. Apache Hadoop was never intended to do it all. The Hadoop ecosystem, however, is hell-bent on doing it all – and then some.

During this debate, when I’ve mentioned Hadoop, I’ve been referring to the Hadoop ecosystem. From what I can tell from Anne’s arguments, she’s been talking about Apache Hadoop. Again, same word, different uses.

And the Alternative is…

Throughout Anne’s argument, she points out the shortcomings of using Apache Hadoop (not the ecosystem) as a data lake. Point taken. But when I asked what organizations are supposed to do when the majority of their data (80-90%) is not sitting in pristine data structures, Anne replied, “It is not the storage and access [of Apache Hadoop] that brings the advantage. The advantage is in the insights derived from the analysis of the data.” What’s still not clear is how and where this analysis is taking place. If a Hadoop-based data lake is not the answer, then what is? 

Without Purpose is Okay

You can see Anne squirming – just like fingernails on a chalkboard – anytime someone mentions collecting and storing data without a purpose or business context. She retaliates with “There’s no value to the organization!” Au contraire, mon ami! Tell Amazon that. They haven’t thrown any data away since day 1. Do you think they knew they’d be getting a patent for anticipatory shipping – i.e., shipping your package before you buy it – when they first started out over 20 years ago?

Today, we have big data technologies, like the Hadoop ecosystem, that allow organizations to collect and store any and all data at a fraction of the cost. I fully agree with Anne that “just because you can doesn’t mean you should” –but I would also contend that just because you can’t define the purpose now doesn’t mean you shouldn’t collect and store it. Don’t be afraid to embrace the unknown unknowns in your data.

Timer: STOP! Total word count: 598

Previously in the Data Lake Debate: