The Data Lake Debate: Pro is Up First
To data lake or not to data lake? That is the question du jour, precipitated by the big data tsunami that hit our enterprise shores a few years ago. Unfortunately, the answer to this question is not so cut-and-dried, as we can see by this small sampling of headlines:
- Gartner says beware of the data lake fallacy [Gartner]
- Gartner gets the ‘data lake’ concept all wrong [InfoWorld]
- The data lake model is a powerhouse for invention [O’Reilly Radar]
- Will data lakes turn into data swamps or data reservoirs? [Tamr]
- Careful: Don’t drown in your data lake! [SmartData Collective]
Some (vendors) would have you believe that the data lake is a panacea for today’s big data challenges; it is not. Some seem to think that taking the data lake route is the easy (and lazy) way out; it is not. And some think it’s a dream. And maybe it is.
In the next seven blog posts (including this one), Anne Buff and I will be discussing the pros and cons of the data lake. The resolution before us is:
The data lake is essential for any organization who wants to take full advantage of its data.
I will be writing in support of this statement, and Anne will be presenting the negative argument. The format of this discussion is loosely structured on the Lincoln-Douglas debate format, so it will include opening arguments, cross-examinations, rebuttals, and summaries.
Let’s get started.
Before I jump into the arguments in support of the data lake, let me define a few key terms from the resolution: The data lake is essential for any organization who wants to take full advantage of its data.
- Data is information produced or stored by a computer that can be digitally transmitted or processed.
- A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.
- An organization is an organized body of people with a particular purpose, such as a company, corporation, institution, non-profit group, or agency.
- To take full advantage means to make use of for economic gain.
Why a Data Lake is Essential
Argument 1. Any and all data can be captured and stored in a data lake.
With today’s big data technologies, such as Hadoop, organizations can store, process, and analyze all their data – not just a portion of it—at a fraction of the cost and time of traditional, relational technologies. And that’s because our tried-&-true technologies from the last three decades really weren’t designed to handle the 3V’s of big data. Let’s quickly review:
- Volume. We’ve all seen the stats: 2.5 exabytes of data are created daily; 90% of all data in the world today was produced in the last two years; and it’s estimated that 40 zettabytes of data will be created by 2020. Even if you can wrap your head around these “big” data volumes—e.g., exabytes, yottabytes and zettabytes—our current relational technologies cannot. Big data technologies, like Hadoop, can.
- Variety refers to the different forms and sources of data. Our relational technologies thrive with structured data, which has been estimated to be only 20% of all data generated. What about the other 80% - the data that we call semi-structured and unstructured, like photos, videos, email, GPS and sensor data? Our relational technologies can only address the 20%; technologies, like Hadoop, can handle it all.
- Velocity is about the speed at which data is being generated. Both humans and machines are generating more data at alarming rates these days. And yes, even though our relational technologies are able to store, process, and analyze a lot of this streaming data, the cost is high, sometimes too high for organizations who are resource-strapped and time-constrained.
With technologies like Hadoop, organizations can capture any data—big and small, internal and external—and store it in a data lake. Depending on corporate strategies and business requirements, a “data lake” can take on different shapes and sizes. For example, it may look like:
- A single data lake; or
- A data lake with multiple data ponds—similar in concept to a data warehouse/data mart model; or
- Multiple, decentralized data lakes; or
- A virtual data lake to reduce data movement.
Here’s the value proposition: With today’s big data technologies, organizations now have an economically attractive option to bring any and all data into a single, scalable infrastructure model. This is a game changer.
Argument 2. A data lake allows for more questions and better answers.
While newer technological capabilities and cost savings are enticing, what’s even more compelling is what business users can do with the data once it’s in a data lake. Think of it as discovering the unknown unknowns. I’ll explain.
Organizations have been capturing data for years, long before big data. Typically, a fraction of this data gets scrubbed, transformed, aggregated, and moved into structured data warehouses, data marts, analytical sandboxes, and the like. Business users then use their reporting and analytical tools to go ask this subset of data predefined questions (based on what and how the data is structured)—and the data answers. This is today’s tried-&-true process.
Here’s how the story changes with a data lake: An organization captures whatever data it wants in its raw form in the data lake. A business user can now ask the data lake any question based on the known data in the lake. Granted, the user may or may not know which questions she wants to ask going in, and that’s okay because it’s all part of the exploration process, i.e., discovering the unknown unknowns. The ability to ask more questions will ideally lead to better and more insightful answers.
Today’s technology landscape is changing fast. Organizations of all shapes and sizes are being pressured to be data-driven and to do more with less. Even though big data technologies are still in a nascent stage, relatively speaking, the impact of the 3V’s of big data cannot be ignored. The time is now for organizations to begin planning for and building out their Hadoop-based data lake.