The Data Lake Debate: Questioning the Pro

Data Lake Debate

Data Lake Debate

Data Lake Debate

Tamara, Tamara, Tamara…We have known each other for quite a while and I cannot believe we are having the same conversation AGAIN! Technology is not the answer for every data issue. I get it – Hadoop and the concept of data lakes are hot topics. However, just because they are trending in the world of technology does not mean that they will solve critical business issues such as taking full advantage of an organization’s data. That being said, I have a few questions for you about the definitions and your arguments.

Anne Buff1. You define data as “information produced or stored by a computer that can be digitally transmitted or processed.” Given that data is not information until meaning is derived from processing within a specific business context or purpose, how can a storage repository which stores data (as you define the data lake) be essential to an organization without purpose?

Tamara DullAnne, Anne, Anne, as always, you look smashing in your rose-colored data management glasses! But this ain’t your grandfather’s rodeo anymore and it’s time to consider new technological pastures.

Okay, so you weren’t fond of my use of the term information in my data definition. That’s fair. It’s confusing and somewhat circular. My point was that the data in data lake is digital in nature. Can we agree on that?

As for your question, do you think I’m suggesting that an organization create a big black box, slap a label of “data lake” on it, and then start filling it up with any and all data—without any context or purpose? As crazy as that sounds (and there are some who are saying this), it is not what I’m suggesting. What I am saying is now that we have the technology to build a proper data lake, it’s time to consider it—not in a “build it and they will come” haphazard fashion, but in a strategic, methodical manner.

Will all the data that comes into the data lake have context and purpose? Absolutely not. Even though that’s the ideal, it’s not realistic. Context and purpose will need to be added as the data is processed and pushed/pulled downstream to other repositories and applications.

Anne Buff2. Just because you can capture and store “any and all data” in a data lake as you state in your first argument, it doesn’t mean you should. Governance is not inherent to big data environments. Data is neutral. What you do with it is not. If collection and discovery are not governed, enormous risk is created for the organization. How do you resolve this?

Tamara DullYes, I totally agree: Just because you can doesn’t mean you should. If we look back over the years, we’ve learned to live with: Just because you want to (store and process any and all data) doesn’t mean you can (due to technology limitations, costs, etc.).

Now that we can—with big data technologies like Hadoop—the question is now shifting to “Should we?” Some are saying, “Sure! Grab it all and throw it in the data lake!” while others are convinced that grabbing it all will only result in a big ol’ smelly data swamp. The correct answer lies somewhere in between these two extremes for an organization.

But make no mistake: The data lake is not a geographical cure. If your organization is already doing a crummy job of not governing and managing the data in your current systems, then moving any data—existing or new—to a data lake is not going to solve this core shortcoming. Your bad data and data practices will follow you.

Anne Buff3. In your game-changing value proposition you contend, “With today’s big data technologies, organizations now have an economically attractive option to bring any and all data into a single, scalable infrastructure model.“ While that sounds ideal, co-located data is not integrated data which is necessary for reporting and analytics. At what point do you consider actually integrating the data?

Tamara DullThe short answer is schema-on-read. What this means is: Instead of structuring the data before it goes into a repository, a data lake—which is called schema-on-write and it’s what we’re currently doing in our relational systems—the data freely flows from lots of different sources into the lake in its raw, native form. 

The data lake inquirer can now apply her own lens to the data as she sees fit—as she’s “reading” and integrating the data from this complex, ever-evolving data lake. Why is this important? First, this allows the inquirer to be extremely agile and go with the flow, if you will. And second, she can start getting value from her data “now”—instead of waiting for it to go through the more traditional schema-on-write process. 

Anne Buff4. In your argument regarding more questions and better answers, you state “A business user can now ask the data lake any question based on the known data in the lake.” Given that the technical skills to access and analyze data from a Hadoop or other big data environment are significantly specialized and not abundant, how do you suggest a business user ask the data lake any question based on the known data in the lake?

Tamara DullYes, I will maintain that any business user should be able to ask the data lake any question—but I don’t believe that any business user should have direct access to the data lake. As I was discussing earlier, a data lake gives its inquirers a lot of flexibility and agility; however, you only want to give that sort of freedom to those that are trained, equipped and empowered to deal with complex, evolving data repositories – yes, those with the technical chops like data scientists and engineers.

As for the rest of the business users: Give ‘em an app! Just kidding—sort of. Since the data lake opens the door to “more questions and better answers,” provide better solutions for business users—whether it be employees, customers or partners—to ask these questions, and maybe even explore some of the answers themselves (where it’s safe to swim). Some of your best questions (and answers) may be resting with this crowd.

Previously in the Data Lake Debate: