Leveraging Metadata for (Really) Big Data

The word “metadata” has different meanings for different people. Most people think of this as the embodiment of big brother grabbing information about everything we do and say. More fundamentally, metadata is really data that describes other data. In essence, it allows for quicker insight or easier interpretation of the data than one might get from analyzing all of the data at an atomic level. Some assume that key elements of the underlying data (like your name, who you called, or where you live) is pulled out of the overall data to create specific metadata. In that context, it is really more akin to indexing an underlying database for quicker analysis of the information you know you are going to want.

Looking back, when reporting and analytics against more traditional relational databases began to be problematic due to the sheer volume of data contained within those databases, we began to see the rise in column based analytic databases which became the defacto approach. Moreover, many of these architectures were designed to be general purpose data warehouses, where the ability to horizontally scale to query larger data volumes in servicing the needs of the enterprise data warehouse works very well.

But, as machine-to-machine sensors, monitors, meters, etc. continue to fuel the Internet of Things, the enormous volumes of data is testing the capabilities of traditional database technologies, and creating a strain on infrastructures that did not contemplate the dramatic increase in the amount of data coming in, the way the data would need to be queried or the changing ways users would want to analyze data. This is why a different path is needed – an approach that is unique in that metadata is built on ingestions of creating indices or projections.

This has its pros and cons. The limitations of the underlying structure of the data becomes more important based on how the mathematics associated with the metadata actually works. In a nutshell, the more the data looks and feels like machine data, the better. So it is not going to be ideally suited as a general purpose data warehouse. On the upside, it has the advantage of very fast load speeds, very tight compression, and exceptional maneuverability over the data to support high performance ad hoc queries and investigative analytics, and does not require a database administrator to manage the indices or tune the database. This is all because of the metadata.

The really big (and really costly) database machines starting with Teradata and extending to IBM/Netezza, Oracle Exadata and SAP/Hana still have practical limitations in terms of the dataset volumes, as do the columnar stores like HP/Vertica, SAP/Sybase IQ, Red Shift and others. These were the enabling technologies when the wall associated with tens and sometimes hundreds of Terabytes became an impediment. But as we all know, data scale salvation was to be found in Hadoop, Cassandra, and other NoSQL variants. Or was it? There is no doubt that any number of use cases are realistically able to work using these technologies where they would have previously been limited.

But more often, we are seeing instances where the “long running query” is a problem, even in the new world of Hadoop and now Spark. This is especially true when there are multiple tables to be joined for complex queries. The idea of SQL on Hadoop solutions like Impala and Hive and Drill provide some relief, but it is hardly nirvana. If you need insight into the correlations that exist amongst multiple multi-petabyte tables, you might be waiting a while. But if that were not enough, virtually all projections suggest the volumes of data we must accommodate is now growing exponentially, primarily to do the rise of the Internet of Things.

There is an old saying that “Necessity is the mother of invention.” In this case, necessity looks to be a function of the massive amount of data where insight into that information is a strategic advantage, if not a basic requirement. The amount of time and certainly the cost associated with gaining that insight is becoming increasingly impractical.

This brings us back to metadata. We will start to hear vendors talking about Metadata more frequently. This makes logical sense, given where we are headed. The way metadata will be used as the market progresses forward is likely to be increasingly associated with addressing the gaps created by the time and money required to deal with atomic data. We can expect there will also be a variety of approaches using metadata based on the number of technology suppliers who so deeply care about this space. That’s a good thing. We will all be better off as the market as a whole evolves to meet our ever changing needs.