The Data Lake Debate: Pro Cross-Examines Con

As to be expected, Anne, your arguments against building a data lake are both persuasive and passionate. You’ve made some great points, my friend, but you’re making this way too easy for me. Before I jump into my rebuttal [my next post], I’d like to clarify a few things that you brought up. I’ve boiled it down to three questions. What say you?

1. In your arguments, you focus on data volumes and the ancillary costs of open source software (OSS) to support these large volumes. Yet, more recent studies show that organizations aren’t as concerned about their data volumes—not everyone is a Google or Facebook—as much as they’re concerned about the variety of data and the ability to integrate it all. How do you address these concerns?

I cannot stress enough that data brought into a data lake is co-located not integrated. Even with schema on read, the integration happens outside of the storage environment – on the banks of this beautiful data lake. Every query that requires a new structure or schema for the data will need to be written from scratch. The cost to value ratio for the time and talent required for this extensive coding (for a still novel technology) for most organizations is limited if not nonexistent. The required skills and abilities to access and integrate data from Hadoop make available talent scarce. You are right, not everyone is a Google or a Facebook. Organizations do not have these skills on staff nor do they have the budget to bring them on.

Hadoop does provide a fantastic data storage opportunity, but it does not require us to abandon all of our existing structured data environments. Copying existing structured data to a data lake (especially transactional data) would be a duplication of effort and storage and would create additional risk for the organization. Moving operational data would be an enormous event, as it would require applications throughout the organization to undergo a significant coding/design overhaul which is not going to be a popular idea in any business unit.

The ideal scenario is to leave existing data where it lives today and use Hadoop as the storage repository for the data that previously could not be stored because of constraints presented by volume, variety or velocity. Organizations can take advantage of data virtualization tools where not only is the integration coding challenge eliminated but other advantages such as centralized security and governance are gained. The data is queried, transformed and structured as needed and provisioned to business users through virtual views. No dumping of data – just purposeful access, integration and use.

2. Related to the first question, you state: “Before organizations start down the path of discovering capabilities within a data lake, they should first turn to taking full advantage of their current data.” What if most of their current data is semi-structured or unstructured data (often cited as much as 80-90%)? How do they take full advantage of that data?

Who’s the one making this easy? Careful throwing those stones Ms. Dull. Your glass house is exquisite.

Historically, in business, unstructured data sources were managed within the scope of knowledge management or content management. The vast storage capabilities that Hadoop presents allows the documents, emails and other unstructured sources to be centrally stored and the content is now considered accessible data. While it is true, the sources can now be accessed through Hadoop to glean the content as ingestible data, it is not the storage and access that brings the advantage. The advantage is in the insights derived from the analysis of the data. Regardless of the type of data (structured, semi-structured or unstructured) or how and where the data is stored, organizations can take full advantage of any and all data by generating value when processing or analyzing it within a specific business context.

3. You seem to suggest a top-down data management approach to big data; for example, “…the real success factor is found in strong data management capabilities under the umbrella of a mature data governance program.” Are you implying a top-down approach to big data? When does a bottom-up approach make sense?

There is a time and a place for both data science and data governance. They do not need to be mutually exclusive. The rigor of data governance is not to create obstacles but to create an environment to foster data management autonomy at the lowest level within the framework of the enterprise data governance program. When it comes to data discovery, governance still has value to protect the organization from compliance and security risks not because of the data itself but how the data is used. I emphatically support innovation labs and data science programs – they are ideal examples of bottom up approaches. However, just because they play in the sandbox, doesn’t mean they don’t follow playground rules.

Thanks, Anne! I’ll get started on my first rebuttal to what you’ve presented. Stay tuned!

Previously in the Data Lake Debate: