The Journey from Big Data to Big Promise

It’s well known that big data is usually stated in terms of the three Vs: Volume, Variety and Velocity. The three Vs appropriately sum up the characteristics of big data and convey that big data is heterogeneous, noisy, dynamic, inter-related and not trustworthy. Companies now strive to convert the three Vs into Big Promises. And Big Data’s promise can be summarized by three new descriptive terms: Veracity, Value and Victory.

1. Three Vs of Big Promise – Veracity, Value and Victory

Like the three Vs of big data that well describe the characteristics of big data, the volume is based on both variety and velocity; the three Vs of Big Promise also has an internal relationship. The Veracity mined from big data, based on volume and variety, determines the Value of big data. The value determines the Victory when a business appropriately applies in a timely manner. The higher the Veracity mined from raw data, the more valuable the result, the smarter decision a business can make, and the more successful the business will become. All those will lead to big Victory for the business.

While much around big data remains hype, many companies are in the fledging stages of drawing value from their big data corpus, and given an army of discussions and opinions around the topic, it’s still hard to find a clear roadmap to arrive at the Big Promise.

2. The Journey from Big Data to Big Promise

Here I share my thoughts about the roadmap from the big picture of big data in a grand view, regardless the type of a business. Basically there are three big steps:

Step 1: Big Data Collection – Gathering Organic Material

Regardless where you are in the journey – it has to start with understanding the nature of the big data defined by three V’s defined though there is voice that put more dimensions into the big data such as value and veracity. However I do not think they are characteristics of big data in raw. Instead I defined them as two characteristics of big data promise.

Step 2: Big Data Analytics – Gleaning Big Insight

The core technologies are big data platform and big data analytics. The big data platform provides the power of speedy processing with millions of records per second. It harness an integrated technologies for transforming organic/raw content to designed content like Natural Language process (NLP), Data Cleansing, transformation (ETL) and filtering methods. The goal aims to transform semi-structured or unstructured data into structured format for easier understanding, analysis and visualization.

Though in the world of analytics, there are many different kinds of analytics terminologies used and referenced like text analytics, social media analytics, customer, social network, business or sentiment analytics, if given deep thoughts on those terminologies, basically analytics can be categorized into three categories functionally, they are Descriptive Analytics, Relationship Analytics, Prescriptive Analytics. The detail is explained as below for each of them.

1. Descriptive Analytics

Once organic data are transformed into designed data from data processing phase, the first analytics is descriptive or exploratory. This phase uses simple statistics to get a general understanding about the data such as data properties like dimensions and field types, statistical profile or summaries like number of records, missing values or field value max, min, median, field value distribution, etc. The exploratory analysis provides us with initial knowledge about the raw content without any deeper digging internal relationships. The process can suggest right strategies to perform deeper analysis. The phase can be done on a random sampled dataset with simple tools like excel sheet and visualized with basic chart types like bar chart, pie chart and scatter plot, etc. The characteristics of the descriptive analytics are:

Autonomy, the analytics performed is based on individual fields and their values and it’s self-government and independent of other fields without considering any connections between different fields and contents.

Shallow and Straight forward, the result from the analysis is usually shallow basic statistics like the frequencies of word count, the number and percent of employees with a earning about 5k within a certain geographic area.

Simple and Easier understanding – As the method to analyze the data is basic statistical profiling without any extra effort involved, so the result is also simple and easier to understand and visualize.

With descriptive analytics, it can reach a general understanding about what happened. It’s like a doctor to find out what happened to a patient, the fact first before he digs out why the patient got the disease.

2. Relationship Analytics

This level analytics aims to dig out embedded valuable insight among the big data. Comparing with the descriptive analytics, the analysis is deeper – in order to succeed at this level, it requires ample mining algorithm or methods like advanced statistics, sophisticated machine learning, inter-disciplined studies, meta or scalable algorithms; the process involved is usually also complicated and performance demanding both in speed and volume.

The reason I called the analytics at this level as relationship analytics because, at this deeper level analytics, its primary goal is to find connection among data elements – the connection may be timely based like sequential dependent relationship or geo location based or functional category based like relationship between production and customer purchasing pattern or transaction based like marketing basket analytics.

During this level analysis, the methods used may be as below:

Inferential or Association draws insight from data through random processes that are developed with statistical methods. Inferential depends on the right population and randomly sampled. For example, the average children height tends to higher than their parents who are usually lower than average height of adults. For basket analysis, through mining millions of transactions, some of items have the higher probability to be bought together by customers like coffee and coffee mater – creams, etc. some of the conclusions are easier to understand and make common sense, however, the high value comes from the conclusions that are against people’s common sense or wrong assumption.

Model based analysis uses pre-developed model based on the known observed data to infer or predict what will happen in the future. Under this category, two sub categories are commonly known, classification and predictive modeling. Usually when the target variable is in different categories and the method is called classification; when it’s numerical or continuous variable, it’s called predictive method. Both methods need a training data set that are well labeled and a test dataset that are drawn from the same population with the training dataset. The analysis has two phases involved, first a model is built with training dataset then evaluated or tested with test dataset for measuring its performance. Once the model is developed, it’s used to predict the future events or target variables based on the independent variables. For example, a linear regression model can be built to predict sales amount based on the factors that affect sales in the last three months then predict the next month sales; a decision tree model can be built to predict whether a specific twitter message is positive or negative, etc. Sometimes classification and predictive methods are overlapped based on the business applications.

Segmentation dynamically group data into different clusters based predefined measurement like distance method. The method is different than the classification or predictive method. It does not need training data or test data. For example, an algorithm can be used to dynamically group similar twitter messages into different clusters.

3. Prescriptive Analytics

Prescriptive analysis is actually a business decision based on the conclusions or results drawn from relationship analysis. For a given situation, what kinds of best action to take so that we gain the expected result in the future? Suppose a patient go to see a doctor, first the doctor performs descriptive analysis, fact finding phase, to understand what happened to the patient and some relative factors like daily activities and workloads and food nutrition, next the doctor perform relationship analysis to find out what are the possible factors that cause the patient sick, finally the doctor will give prescription to the patient like medicines to take so that the patient can get well.

Step 3: Reap Big Promise

In order to fully empower a business with insight drawn from analytics – first the veracity of the result has to be verified before it can be deployed into a business application for generating valuable results. The main approaches that are used to evaluate the veracity of analytics results or models built include precision, recall and accuracy. Also we need to consider the business cost for each error made in dollars. Basically, there are three phases in evaluating the performance, 1) once the model or algorithm developed, the performance can be evaluated based on an validation dataset that are drawn from the same population of the training dataset. If the result is not good enough, the model needs to be redeveloped by adding more data or perform some tuning by adjusting parameters or exploring other methods; 2) the model is evaluated against a test dataset that are drawn from a different dataset than training data. This dataset is more representative to the real world dataset at the point of developing the model and the associate error cost should be also measured based on business objectives; 3) the model will be evaluated on an on-going process. Because the world changes so fast, new data comes in and they may be pretty different from the dataset used to develop the model. The phase 3 should be performed in a regular scheduled base so that the prediction will not go too far off the expected and causes business crashes. In the process, once it’s found the model does not perform well enough anymore, the process will go back to 2).

The values from data veracity of 2) are also count on how well the business takes full advantage of them – how many opportunities to use them to provide business intelligence to customers. Exploring the right business opportunities and defining the right objectives are the key factors for generating business values. If a company can generate higher revenue, victory will be shining out brightly tomorrow.

(image: Big Data journey / shutterstock)