Data Variety: What It's All About
Data variety stands out from the three Vs of big data from the report of the big data survey conducted by NewVantage Partners in 2012. One of the survey results shows companies focusing more on data variety instead of data volume both now and in the next three years. The report does not tell why data variety turns out to be such a salient attribute while big data platform like Hadoop focus more on addressing and solving data volume problem. Of course, data variety contributes part of the data volume.
Data variety is similar to the nature of diversity of species in the world that demonstrates the richness of information. When exploring the data variety, it’s wise to have a mindset to listen patiently to different voices about customer’s needs. We must believe customers have put their voices and problems into the data you collected. As end users, some of them may not know what they want but they present what they want in the form of problems and hide them into the data variety and volume. Once you have such a mindset, you will be amazed by what you will find.
- Data Variety is not about only data sources, types and structures
When talking about data variety, most often people talk about multiple or diverse data sources, variant data types, structures and formats, say, structured, semi or non-structured data like text, images and videos. Except for those most common types of variety, contextual information around data and the methods used for creating and gathering data as well as the high dimensionality of data should be also considered as data variety. Those varieties can be counted as objective or physical elements of data variety.
Except for the objective nature, data variety also includes subjective nature that is usually missing or ignored by people. What I mean by subjective variety is the interpretation of data or the insight from different perspectives and different entities like people, group and business and their corresponding usages or applications. Because those factors actually drive the way to analyze, mine, integrate and use data or explain the results. And the subjective variety matters as much as objective variety. I also believe subjective variety will drive more objective data varieties.
The Curse and Challenge of Data Variety
First data variety brings challenges to data processing, analysis, mining and modeling because data is not in a uniform or standard form. For example, a person’s name may be in different variant form. In order to mine any insight from user data, it requires intensive efforts to preprocess data that include cleansing, normalization and standardization, handling missing values and correcting errors, etc. Otherwise, the model built will lack accuracy or business will make wrong decision.
Second data variety challenges relational database in design, store and maintenance – NoSQL database comes in as the main trend for big data storage because of its flexibility to add or remove data element easily. However the convenience at storage layer brings new challenges at query layer. With structured database, analyst can easily perform all kinds of queries or reports by slicing and dicing data dynamically and quickly, but not with NoSQL database.
Third data variety breaks the link and wholeness of entities, records or content. Suppose a user has a Facebook account that may be totally different than his/her account on LinkedIn, G+ or YouTube. The same person may give similar messages in different approaches – text or audio or videos. From a raw collection, it’s hard to know if the messages are from a single person and it’s also hard to tell if different friends and hobbies are related to the same person. From Facebook, you get one perspective of a person; from LinkedIn, you get another view; and on YouTube, you hear something different, but they are from same person. Similar to products and services – they can be talked and discussed by the same group users at different places and times from different perspectives. We need to find the hidden links for the right reason but we cannot find them directly like linked web pages – that’s the challenge!
The Blessing and the Value Proposition of Data Variety
If common patterns and trends can be discovered based on data volume that represent the over popularity or publicity then deeper relationships and 360 degree views of entities are most likely found based on data variety.
Bear in mind we are in a world of paradox, paradox is the foundation of all beings. Where there is a curse, there is a blessing; where there is a problem or challenge, there is an opportunity, and the bigger the curse, the bigger the blessing and the bigger the challenge, the bigger the opportunity. Any opportunity is wrapped in the form of a problem. Without problem, there is no opportunity at all!
If you are a forethought leader, you must already have realized some of the curses or challenges above already turned into blessings, some of the blessings are still on the way to its fullness. Many data preprocessing technologies have been in market to improve data quality. Record Linking technology is used to integrate content, resolve entity identity problems and remove duplicates. As the data variety becomes richer, there is a need for more advanced technologies to provide innovative solutions to all challenges. Data variety is calling for more innovative solutions now than before.
Below is a possible list that we can rely on data variety to create new opportunities:
- Creates an entity portfolio – combine different natures in space and time to build a horizontal or vertical view about an entity – think about any possible entities. They are not only people or organizations.
- Build relationship between entities or dimensions of the same entity like relationships among an entity and its contextual information (where, what, who, when), interests and friends, brands and products, etc.
- Deliver multiple messages from different perspectives of crowd entities and dimensions.
- Enforce the same messages from different channels, resources or time periods repeatedly.
- Reveal root causes for a specific problem and explore the deep users’ intent.
- Support real time applications like advertising – based on where a user appears and clicks a link in combination with related specific information, contextual ads can be delivered at real time according to a user’s intent and interest reasoned from the variant data elements.
With that having been said, the curse and blessing are hand in hand and the problem is usually a shadow of an opportunity from its light: follow the shadow, the opportunity will be found. Because of my personal limitations, I believe the above list is not enough. Please add yours to the list or share your problems and challenges in handling data variety. Explore ways to enrich data variety and make it enrich your business by discovering the golden opportunities behind. I believe, in the future, it’s possible that you know much more a user than he knows himself just because you take advantage of data variety. Data variety is the next winning star of big data as it holds the promise to a blooming business.
Ling has been with Comcast for about 2 years as a senior manager of advanced analytics, data scientist in Enterprise Business Intelligence group. Prior to that Ling was a lead data scientist in an advertising industry, where she developed analytics road map, performed intensive analytics work to drive product improvement and innovation; she also has been a senior Data Scientist in a start-up ...