What Is Your Big Data Analytics Stack?

We often get asked this question – Where do I begin? How are problems being solved using big-data analytics?

To answer this question we need to take a step back and think in the context of the problem and a complete solution to the problem.

The objective of big data, or any data for that matter, is to solve a business problem. The business problem is also called a use-case. We always keep that in mind. The easiest way to explain the data stack is by starting at the bottom, even though the process of building the use-case is from the top.

We often get asked this question – Where do I begin? How are problems being solved using big-data analytics?

To answer this question we need to take a step back and think in the context of the problem and a complete solution to the problem.

Data Layer: The bottom layer of the stack, of course, is data. This is the raw ingredient that feeds the stack. The players here are the database and storage vendors. Hadoop, with its innovative approach, is making a lot of waves in this layer.

Data Preparation Layer: The next layer is the data preparation tool. As we all know, data is typically messy and never in the right form. Data preparation is the process of extracting data from the source(s), merging two data sets and preparing the data required for the analysis step. There are emerging players in this area.

Analysis Layer: The next layer is the analysis layer. Statistics is the most commonly known analysis tool. For statistics, the commonly available solutions are statistics and open source R. This is the layer for the emerging machine learning solutions. Automated analysis with machine learning is the future.

Presentation Layer: The output from the analysis engine feeds the presentation layer. The presentation layer depends on the use-case. This layer is called the action layer, consumption layer or last mile.

If the result of the use case is to be presented to a human, the presentation layer may be a BI or visualization tool. Example use-cases are fraud detection, Order-to-cash monitoring, etc. In each case the final result is sent to human decision makers for them to act.
For some use-cases, the results need to feed a downstream system, which may be another program. Example use-cases are recommendation systems, real-time pricing systems, etc. In this case the analysis results are fed into the downstream system that acts on it.
If the use-case is an alerting system, then the analysis results feed an event processing or alerting system. Example use-cases are medical device failure, network failure, etc. In this case the results of the analysis are fed into a system that can send out alerts to humans or machines that will act on the results in real-time or near real-time.

Use-case Layer: This is the value layer, and the ultimate purpose of the entire data stack. The use-case drives the selection of tools in each layer of the data stack. The number of use-cases is practically infinite. Example use-cases are fraud detection, dropped call alerting, network failure, supplier failure alerting, machine failure, and so on. These are like recipes in cookbooks – practically infinite. As the types and amount of data grows, the number of use-cases will grow.

How do you think about your data stack?

What are your thoughts?