The Data Lake Debate: The Final Word from Negative

Data Lake Debate Anne Buff

Well, it seems you took the gloves off this time, Tamara. I appreciate the valiant effort and your passionate belief in the Hadoop ecosystem. However, given your revisit to the definition of the data lake and clarifications about Hadoop, I find it important to repeat the resolution we are debating: “a data lake is essential for any organization to take full advantage of its data”. We are not debating whether a data ecosystem is essential – just the data lake. While I will stand strong with you that a well-designed data ecosystem (open source or proprietary) of many interdependent systems is critically imperative for businesses to succeed in today’s digital world, there are still ample concerns and cautions to consider before declaring the data lake essential. As I reflect on our debate, the following are the key issues keeping the data lake from the prestige and splendor in which you have presented it.

Physical attributes do not determine business value. Regardless of shape, size, or other linguistic expression used to define the qualities of a data lake, the data lake still remains a storage repository. Until the data is processed and consumed, it does not provide business value. Any storage repository on its own does not prove itself essential for the organization; it must be part of a larger, well-designed data infrastructure. The options for data storage architectures are numerous and the implementation choice should be contingent upon business need and technical requirements. A data lake is not the catchall answer.

The talent gap is real. If we were to accept your argument that the Hadoop Ecosystem is what organizations should be considering, the technical skills to support the environment would become even greater than just considering Hadoop, the open source project. As I mentioned before, finding individuals with the skills to access, query and manage just Apache Hadoop is difficult. If you add in the need for skills using Hive, Spark, Ambari, Pig, HBase, etc. and the wide variety of vendor distributions the talent pool is significantly smaller. In the event an organization is able to hire the talent (or grow it in-house) the cost and paranoia of turnover dramatically rises.

The risk is greater than the reward. It does sound idyllic to have any and all of the organization’s data in a central location to serve the needs of the entire enterprise. But, at what cost? As I mentioned before, copying existing structured data to a data lake (especially transactional data) would be a duplication of effort and storage and would create additional risk for the organization. How many copies of the data do we need anyways? The source system, the data mart/store, the data warehouse and now the data lake? Data integration is far more important than data co-habitation. Data governance and security are not inherent to the data lake environment (regardless of form). Without policies, procedures and additional technology to secure and protect this massive collection of data, the organization is at enormous risk. No executive in his or her right mind will jump on board for this. There is a reason Capgemini Consulting found “only 13 percent of organizations have achieved full-scale production for their big data implementations” and “only 27 percent of the executives surveyed described their big data initiatives as successful.” The data lake is no exception.

Collection without purpose is hoarding. Like you said, not everyone is a Google or a Facebook. Well, not everyone is Amazon either. Storing everything is just not an option for most organizations. So the question becomes, “What should be stored?” Answering this question without consideration of strategic business initiatives or goals is futile.

The organizations with which I have worked that have implemented a data lake or a data-lake-like environment for technical initiatives have all had the same concern – “Now that it is built, we need to convince the business to use it.” To establish value and ensure use, the business needs to be involved in the data lake development from the onset. Business stakeholders care about what is stored – not how it is stored. Value will not magically appear without purpose.

All of that being said, there is one scenario where “without purpose” becomes the purpose (I mentioned this before as well.) In the world of analytics and data science, the data lake becomes a gold mine. The volume and variety of big data combined with the accuracy and structure of operational data provides a rich and fruitful environment for data wizards to develop and refine models that generate insights we never thought possible. Even in this situation though, I would argue that while the data lake is definitely valuable, the essential component is the brilliant analytical minds.

The alternative is…

You asked, “If a Hadoop-based data lake is not the answer, then what is?” Organizations should absolutely begin to consider new ways of collecting, packaging and delivering data both internally and externally. Ultimately, it doesn’t matter how or where the data is stored but instead how it is integrated and accessed for purpose. An organization’s data infrastructure and strategy will be an evolution based on business needs and initiatives, budgets, technical skills and available technologies. In time, a data lake may in fact be a valuable asset in the essential and indispensable well-designed, purpose-built data ecosystem. But, by then maybe it will be a data river (ever flowing), or a data mountain (peaks and valleys), or whatever trendy industry term comes to be. Any which way, it will only a be a part, never the essential component.

And for the record…

Not all data lakes are Hadoop-based.

Previously in the Data Lake Debate: