Why Variety Is the Unsolved Problem in Big Data

The term “big data” is thrown around rather loosely today. To apply more structure, Gartner classifies big data projects by the “3 V’s” – volume, velocity, and variety in its IT glossary:

“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

Technology advances have helped us enormously in dealing with the first two attributes – volume and velocity. Advances in storage technologies have brought down costs of storing all of that data, and technologies like Apache™ Hadoop® help companies assemble the processing power by distributing computing across inexpensive, redundant components.

But the issue of data variety remains much more difficult to solve programmatically. Instead, we call on experts in big data applications in specific domains. As a result, many big data initiatives remain constrained by the skills of the people available to work on them. And this challenge is keeping the industry from realizing the full potential of big data in diverse fields.

The symptom of the problem: Services spending

If you look at recent history, most technology innovations follow a pattern. As technologies evolve, eventually the differentiation – and money – flows to the software. Marc Andreessen famously outlined this pattern with his “Software is Eating the World” manifesto in the Wall Street Journal in 2001.

Now look at big data spending today – according to recent numbers from Gartner, spending on services outweighs spending on software by a ratio of nine to one*. Even if you account for the fact that much of the software is open source, that’s still a lot of spending on services. In fact, Gartner projects that services spending will reach more than $40 billion by 2016.

Services spending is a symptomatic of a larger problem that cannot easily be solved with software. If it was easily solvable, someone would have figured it out, given the amount of spending going into services today. I think that the problem lies in data variety – the sheer complexity of the multitude of data sources, good and bad data mixed together, multiple formats, multiple units and the list goes on. As a result of this unsolved problem, we’re grooming a large field of specialists with proficiency in specific domains, such as marketing data, social media data, telco data, etc. And we’re paying those people well, because their skills are both valuable and relatively scarce.

Drilling down into the data variety problem

When META Group (now Gartner) analyst Doug Laney first wrote about the big data definition in 2001, he discussed the ‘variety’ part of the big data challenge as referring to data formats, structures and semantics.

More than a decade later, the online world is a much larger, more interconnected and complex place. The sheer variety of available data for analysis has grown exponentially since that definition in 2001. To paraphrase Hamlet, “There are more data types in cyberspace than are dreamt of in your definitions.” And with the coming Internet of Things, the variety of data will continue to grow as the devices collecting and sending data proliferate.

Data variety and context

When it comes to data variety, a large part of the challenge lies in putting the data into the right context. Nothing exists in isolation in today’s networked world as most of the big data available for analysis is linked to outside entities and organizations. Making sense of the context takes time and human understanding and that slows everything down.

Today, it falls to people to address the larger problem of variety by making sense of and adding context to the diverse data types and sources (hence the large services spending cited above). These people need both domain expertise, to understand the context of the data, and big data skills, to understand how to use the data.

Until we come up with a scalable and viable way to address the “high-variety” part of the big data challenge, we’ll continue to rely on people and services. This will keep the cost of big data initiatives high and limit their applications in new environments, where the potential for new insights may be high, but the budget simply doesn’t exist to apply big data disciplines.

*Gartner, “Big Data Drives Rapid Changes in Infrastructure and $232 Billion in IT Spending Through 2016,” October 2012

image: variety/shutterstock