Aligning Big Data
In order to bring some semblance of simplicity, coherence and integrity to the Big Data debate I am sharing an evolving model for pervasive information architecture and management.
This is an overview of the realignment and placement of Big Data into a more generalized architectural framework, an architecture that integrates data warehousing (DW 2.0), business intelligence and statistical analysis.
The model is currently referred to as the DW 3.0 Information Supply Framework, or DW 3.0 for short.
In a previous piece with the name of Data Made Simple - Even 'Big Data' (available here on the Good Strat Blog), I looked at three broad-brush classes of data: Enterprise Operational Data; Enterprise Process Data; and, Enterprise Information Data. The following is a diagram taken from that piece:
Fig. 1 - Data Made Simple
In simple terms the classes of data can be defined in the following terms:
Enterprise Operational Data – This is data that is used in applications that support the day to day running of an organisation’s operations.
Enterprise Process Data – This is measurement and management data collected to show how the operational systems are performing.
Enterprise Information Data – This is primarily data which is collected from internal and external data sources, the most significant source being typically Enterprise Operational Data.
These three classes form the underlying basis of DW 3.0.
The overall view
The following diagram illustrates the overall framework:
Fig. 2 – DW 3.0 Information Supply Framework
There are three main elements within this diagram: Data Sources; Core Data Warehousing (the Inmon architecture and process model); and, Core Statistics.
Data Sources – This element covers all the current sources, varieties and volumes of data available which may be used to support processes of 'challenge identification', 'option definition', decision making, including statistical analysis and scenario generation.
Core Data Warehousing – This is a suggested evolution path of the DW 2.0 model. It faithfully extends the Inmon paradigm to not only include unstructured and complex data but also the information and outcomes derived from statistical analysis performed outside of the Core Data Warehousing landscape.
Core Statistics – This element covers the core body of statistical competence, especially but not only with regards to evolving data volumes, data velocity and speed, data quality and data variety.
The focus of this piece is on the Core Statistics element. Mention will also be made of how the three elements provide useful synergies.
The following diagram focuses on the Core Statistics element of the model:
Fig. 3 – DW 3.0 Core Statistics
What this diagram seeks to illustrate is the flow of data and information through the process of data acquisition, statistical analysis and outcome integration.
What this model also introduces is the concept of the Analytics Data Store. This is arguably the most important aspect of this architectural element.
For the sake of simplicity there are three explicitly named data sources in the diagram (of course there can be more, and the Enterprise Data Warehouse or it's dependent Data Marts may also act as a data source), but for the purpose of this blog piece I have limited the number to three: Complex data; Event data; and, Infrastructure data.
Complex Data – This is unstructured or highly complexly structured data contained in documents and other complex data artefacts, such as multimedia documents.
Event Data – This is an aspect of Enterprise Process Data, and typically at a fine-grained level of abstraction. Here are the business process logs, the internet web activity logs and other similar sources of event data. The volumes generated by these sources will tend to be higher than other volumes of data, and are those that are currently associated with the Big Data term, covering as it does that masses of information generated by tracking even the most minor piece of 'behavioural data' from, for example, someone casually surfing a web site.
Infrastructure Data – This aspect includes data which could well be described as signal data. Continuous high velocity streams of potentially highly volatile data that might be processed through complex event correlation and analysis components.
The Revolution Starts Here
Here I will backtrack slightly to highlight some guiding principles behind this architectural element.
Without a business imperative there is no business reason to do it: What does this mean? Well, it means that for every significant action or initiative, even a highly speculative initiative, there must be a tangible and credible business imperative to support that initiative. The difference is as clear as that found between the Sage of Omaha and Santa Claus.
All architectural decisions are based on a full and deep understanding of what needs to be achieved and of all of the available options: For example, rejecting the use of a high performance database management product must be made for sound reasons, even if that sound reason is cost. It should not be based on technical opinions such as "I don't like the vendor, much". If a flavour of Hadoop makes absolute sense then use it, if Exasol or Oracle or Teradata make sense, then use them. You have to be technology agnostic, but not a dogmatic technology fundamentalist.
That statistics and non-traditional data sources are fully integrated into the future Data Warehousing landscape architectures: Building even more corporate silos, whether through action or omission, will lead to greater inefficiencies, greater misunderstanding and greater risk.
The architecture must be coherent, coherent, usable and cost-effective: If not, what's the point, right?
That no technology, technique or method is discounted: We need to be able to cost-effectively incorporate any relevant, existing and emerging technology into the architectural landscape.
Reduce early and reduce often: Massive volumes of data, especially at high speed, are problematic. Reducing those volumes, even if we can't theoretically reduce the speed is absolutely essential. I will elaborate on this point and the following separately.
That only the data that is required is sourced. That only the data that is required is forwarded: Again, this harks back on the need for clear business imperatives tied to the good sense of only shipping data that needs to be shipped.
Reduce Early, Reduce Often
Here I expand on the theme of early data filtering, reduction and aggregation. We may be generating increasingly massive amounts of data, but that doesn't mean we need to hoard all of it in order to get some value from it.
In simplistic data terms this is about putting the initial ET in ETL (Extract and Transform) as close to the data generators as possible. It's the concept of the database adapter, but in reverse.
Let's look at a scenario.
A corporation wants to carry out some speculative analysis on the many terabytes of internet web-site activity log data being generated and collected every minute of every day.
They are shipping massive log files to a distributed platform on which they can run data mapping and reduction.
Then they can analyse the resulting data.
The problem they have, as with many web sites that were developed by hackers, designers and stylists, and not engineers, architects and database experts, is that are lumbered with humungous and unwieldy artefacts such as massive log files of verbose, obtuse and zero-value adding data.
What do we need to ensure that this challenge is removed?
We need to rethink internet logging and then we need to redesign it.
- We need to be able to tokenise log data in order to reduce the massive data footprint created by badly designed and verbose data.
- We need to have the dual option of being able to continuously send data to an Event Appliance that can be used to reduce data volumes on an event by event and session basis.
- If we must use log files, then many small log files are preferable to fewer massive log files, and more log cycles are preferable to few log cycles. We must also maximise the benefits of parallel logging. Time bound/volume bound session logs are also worth considering and in more depth.
So now, we are getting log data to the point of use either via log files, log files produced by an Event Appliance (as part of a toolkit of Analytic Data Harvesting Adapters) or sent by that appliance to a reception point via messaging.
Once that data has been transmitted (conventional file transfer/sharing or messaging) we can then move to the next step: ET(A)L - Extract, Transform, Analyse and Load
For log files we would typically employ ET(A)L but for messages of course we do not need the E, the extract, as this is about direct connect.
Again the ET(AL) is another form of reduction mechanism, which is why the analytics aspect is included to ensure that the data that gets through is the data that is needed, and that the junk and noise that has no recognisable value, gets cleaned out early and often.
The Analytics Data Store
The ADS (which can be a distributed data store on a Cloud somewhere) supports the data requirements of statistical analysis. Here the data is organised, structured, integrated and enriched to meet the ongoing and occasionally volatile needs of the statisticians and data scientists focusing on data mining. Data in the ADS can be accumulative or completely refreshed. It can have a short life span or have a significantly long life-time.
The ADS is the logistics centre for analytics data. Not only can it be used to provide data into the statistical analysis process, but it can also be used to provide persistent long term storage for analysis outcomes and scenarios, and for future analysis, hence the ability to 'write back'.
The data and information in the ADS may also be augmented with data derived from data stored in the data warehouse, it may also benefit from having its own dedicated Data Mart specifically designed for this purpose.
Results of statistical analysis on the ADS data may also result in feedback being used to tune the data reduction, filtering and enrichment rules further downstream, either in smart data analytics, complex event and discrimination adapters or in ET(AL) job streams.
That's all folks.
This has been necessarily a very brief and high-level view of what I currently label DW 3.0.
The model doesn't seek to define statistics or how statistical analysis is to be applied, which has been done more than adequately elsewhere, but how statistics can be accommodated in an extended DW 2.0 architecture, and without the need to come up with almost reactionary and ill-fitting solutions to problems that can be solved in better and more effective ways through good sense, sound engineering principles and the judicious application of appropriate methods, technologies and techniques.
If you have questions or suggestions or observations regarding this framework, then please feel free to contact me.
Many thanks for reading.
The following is a slide-set that also illustrates the DW 3.0 framework, with a special focus on the Analytics Data Store: