Consider This: The Big Data Workout

May 4, 2015
166 Views

To begin at the beginning

Miss Piggy said, “Never eat more than you can lift”. That statement is no less true today, especially when it comes to Big Data.

The biggest disadvantage of Big Data is that there is so much of it, and one of the biggest problems with Big Data is that few people can agree on what it is. Overcoming the disadvantage of size is possible; overcoming the problem of understanding may take some time.

To begin at the beginning

Miss Piggy said, “Never eat more than you can lift”. That statement is no less true today, especially when it comes to Big Data.

The biggest disadvantage of Big Data is that there is so much of it, and one of the biggest problems with Big Data is that few people can agree on what it is. Overcoming the disadvantage of size is possible; overcoming the problem of understanding may take some time.

As I mentioned in my piece Taming Big Data, “the best application of Big Data is in systems and methods that will significantly reduce the data footprint.” In that piece I also outlined three conclusions:

  • Taming Big Data is a business, management and technical imperative.
  • The best approach to taming the data avalanche is to ensure there is no data avalanche – this approach is moving the problem upstream.
  • The use of smart ‘data governors’ will provide a practical way to control the flow of high volumes of data.

“Data Governors”, I hear you ask, “What are Data Governors?”

Let me address that question.

Simply stated, the Data Governor approach to Big Data obtuseness is this:

  • The Big Data Governor’s role is to help in the purposeful and meaningful reduction of the ever-expanding data footprint, especially as it relates to data volumes and velocity (see Gartner’s 3Vs).
  • The reduction techniques are exclusion, inclusion and exception.
  • It’s implementation is made through a development environment that can target hardware, firmware, middleware and software forms of hosting and continuously monitored execution.

In Short, it is a comprehensive approach to reducing the Big Data footprint whilst simultaneously maintaining data fidelity.

Here are some examples:

Integrated Circuit Wafer Testing

What’s this all about? Here’s an answer the good folk at Wikipedia cooked up earlier:

“Wafer testing is a step performed during semiconductor device fabrication. During this step, performed before a wafer is sent to die preparation, all individual integrated circuits that are present on the wafer are tested for functional defects by applying special test patterns to them. The wafer testing is performed by a piece of test equipment called a wafer prober. The process of wafer testing can be referred to in several ways: Wafer Final Test (WFT), Electronic Die Sort (EDS) and Circuit Probe (CP) are probably the most common.” (Link / Wikipedia)

Fig.1 – IC Fab Testing and the CE Data Governor

This exhibit shows where the Data Governor is placed in the Integration Circuit fabrication and testing/probing chain.

In large plants, the IC probing process generates very large volumes of data at high velocity rates.

Based on exception rules the Data Governor reduces the flow of data to the centralised data store.

It also speeds up velocity and time to analysis.

Greater speed and less volume mean that production showstoppers are spotted earlier, thereby potentially leading to significant savings in production and recuperation costs.

Let’s look at some of the technical details:

  • Taking our example of the IC Fab test/probe chain, a Data Governor should be able to handle a hierarchy or matrix of designation and exception.
  • For example, a top-level Data Governor actor could be the Production Run actor.
  • The Production Run actor could designate and assign exception rules to a Batch Analysis actor.
  • In turn, the Batch Analysis actor could designate and assign exception rules to a Wafer Instance Analysis actor.

The Internet of Things – IoT

Intrinsically linked to Big Data and Big Data Analytics, the Internet of Things (IoT) is described as follows:

“The Internet of Things (IoT) is the network of physical objects or “things” embedded with electronics, software, sensors and connectivity to enable it to achieve greater value and service by exchanging data with the manufacturer, operator and/or other connected devices. Each thing is uniquely identifiable through its embedded computing system but is able to interoperate within the existing Internet infrastructure.” (Link / Wikipedia)

Fig.2 – The Internet of Things and the CE Data Governor

This exhibit shows where the Data Governor is placed in the Internet of Things data flow.

The Data Governor is embedded into an IoT device, and functions as a data exception engine.

Based on exception rules and triggers the Data Governor reduces the flow of data to the centralised / regionalised data store.

It also speeds up velocity and time to analysis.

Greater speed and less volume means that important signals are spotted earlier, thereby quite possibly leading to more effective analysis and quicker time to action.

Net Activity

Much play is made of the possibility that we will all be extracting golden nuggets from web server logs sometime in the near future. I don’t want to get into the business value argument here, but would like to describe a way of getting Big Data to shed the excess web-server-log bloat.

Fig.3 – Web Server Activity Logging and the CE Data Governor

This exhibit shows where the Data Governor is placed in the capture and logging of interactive internet activity.

The Data Governor acts as a virtual device written to by standard and customised log writers, and functions as a data exception engine.

Based on exception rules and triggers the Data Governor reduces the flow of data from internet activity logging.

It also speeds up velocity and time to analysis.

Greater speed and significantly reduced data volumes may lead to more effective and focused analysis and quicker time to action.

Signal Data

Signal data can be a continuous stream of data originating from devices such as temperature and proximity sensors, by its nature, it can generate high-volumes of data and at high velocity –  it can add lots of data, and very quickly.

Fig.4 – Signal Data and the CE Data Governor

This exhibit shows where the Data Governor is placed in the stream of continuous signal data.

The Data Governor acts as an in-line data-exception engine.

Based on exception rules and triggers the Data Governor reduces the flow of signal data.

It also speeds up velocity and time to analysis.

Greater speed and significantly reduced data volumes may lead to more effective and focused analysis and quicker time to action.

Machine Data

“Machine-generated data is information which was automatically created from a computer process, application, or other machine without the intervention of a human.” (Link / Wikipedia)

Fig.5 – Machine Data and the CE Data Governor

This exhibit shows where the Data Governor is placed in the stream of continuous machine generated data.

The Data Governor acts as an in-line data analysis and exception engine.

Exception data is stored locally and periodically transferred to an analysis centre.

Analysis of the totality of the same class and origins of data can be used to drive ANN* and statistical analysis which can be used to support (for example) the automatic and semi-automatic generation of preventive maintenance rules.

Greater speed and significantly reduced data volumes may lead to more effective and focused analysis and quicker time to proactivity.

Other Applications of the Data Governor

The options are not endless and the prizes are not rich beyond the dreams of avarice, but there are some exciting possibilities out there. Including applications in the trading; plant monitoring; sport; and, climate change ‘spaces’.

Fig.6 – Other Applications in the Big Data ‘space’ and the CE Data Governor

Summary

To wrap up, this is what the CE Data Governor approach looks like at a high level of abstraction:

  • Data is generated, captured, created or invented.
  • It is stored to a real device or virtual device.
  • The Data Governor (in all its configurations) acts as a data discrimination and data exception manager and ensures that significant data is passed on.
  • Significant data is used for ‘business purposes’ and to potentially refine the rules of the CE Data Governor.

To summarise the drivers:

  • We should only generate data that is required, that has value, and that has a business purpose – whether management oriented, business oriented or technical in nature.
  • We should filter Big Data, early and often.
  • We should store, transmit and analyse Big Data only when there is a real business imperative that prompts us to do so.

Moreover, we have a set of clear and justifiable objectives:

  • Making data smaller reduces the data footprint – lower cost, less operational complexity and greater focus.
  • The earlier you filter data the smaller the data footprint is – lower costs, less operational complexity and greater focus.
  • A smaller data footprint accelerates the processing of the data that does have potential business value – lower cost, higher value, less complexity and best focus.

Many thanks for reading.

Can we help? Leave a comment below. contact me via LinkedIn or write to martyn.jones@cambriano.es

Additional graphics: