Data Integration Ecosystem for Big Data and Analytics
In my article, “Data Integration Roadmap to Support Big Data and Analytics,” I detailed a five step process to transition traditional ETL infrastructure to support the future demands on data integration services. It is always helpful if we have an insight into the end state for any journey. More so for the data integration work that is constantly challenged to hit the ground running.
There are two major architectural changes that are shaking the traditional integration platforms warranting a journey into the future state. First, the ability and needs for organizations to store and use big data. Most of the big data has always been available for a longtime, but only now there are tools and techniques available to process it for the business benefits. Second, the need for predictive analytics based on the history or patterns of past or hypothetical data driven models. While the business intelligence deals with what has happened, business analytics deal with what is expected to happen. The statistical methods and tools that predict the process outputs in the manufacturing industry have been there for several decades, but only recently they are being experimented with the organizational data assets for a potential to do a much broader application of predictive analytics.
The diagram below depicts the most common end state for the data integration ecosystem. There are six major components in this system.
Sources – the first component is the set of the sources for structured or unstructured data. With the addition of cloud hosted systems and the mobile infrastructure, the size, velocity and complexity of the traditional datasets began to multiply significantly. This trend is likely continue and computer sciences corporation predicated that data production will be 44 times more in 2020 when compared with the corresponding in 2009. With this level of growth, data sources and their sheer volume forms the main component of the new data integration ecosystem. Data integration architecture should enable multiple strategies to access or store this diverse, volatile and exploding amount of data.
Big Data Storage – while the big data storage systems like Hadoop provide good means to store and organize large volumes of data, presently, processing it to extract the snippets of useful information is hard and tedious. Map/Reduce architecture of these systems gave ability to quickly store large amounts of data and opened up doors to many new data analytics opportunities. The data integration platform needs to build the structure for big data storage and map out its touch points with the other enterprise data assets.
Data Discovery Platform – the data discovery platform is a set of tools and techniques that work on the big data file system to find patterns and answers to questions business may have. Presently, it is mostly an Adhoc work and organizations still have difficulty putting a process around it. Most people compare the data discovery activity with the gold mining. Only that in this case, by the time one completes mining gold, the silver becomes more valuable. In other words, what is considered valuable information now may be history and unusable only a few hours later. The data integration architecture should encompass this quick and fast paced data crunching enforcing the data quality and the governance. As I detailed in my article, “Data Analytics Evolution at LinkedIn - Key Takeaways,” strategies such as LinkedIn’s “three second rule,” can drive the data integration infrastructure to be very responsive to meet the end user adaptation needs. According to LinkedIn, the repeated Adhoc requests are systemically met by developing data discovery platform that has a very high degree of reusability of the lessons learned.
Enterprise Data Warehouse – the traditional data warehouses will continue to support the core information needs, but will have to encompass the new features to integrate better with the unstructured data sources and also the performance demands of the analytics platforms. Organizations have begun to develop new approaches to isolate the operational analytics from deep analytics on the history for strategic decisions. The data integration platform should be versatile to isolate the operation information from the strategic longer-term data assets. Also the data integration infrastructure needs to be more temperamental to enable quick access to most widely and frequently accessed data.
Business Intelligence Portfolio – the business intelligence portfolio will continue to focus on the past performance / results even though there would be increased demands for operational reporting and performance. The evolving needs of self-service BI and mobile BI will continue to post architectural challenges to the data integration platforms. One other critical aspect would be BI portfolio’s ability to integrate with the data analytics portfolio. This need may further increase the demands on enterprise information integration.
Data Analytics Portfolio – there is a reason why they call people working with data analytics as data scientists. Analytical work that goes on within this portfolio need to deal with business as well as data problems and the data scientists need to work their way through building the predictive models that add value to the organization. Data integration platform plays two roles to support the analytics portfolio. First, data integration ecosystem should enable access to structured or unstructured data for analytics. Second, enable re-usability of the past analytics activity to make the field more of an engineering activity than science by reducing the scenarios requiring reinventing the wheel.
In summary, data integration ecosystem of the future will encompass processing very large volumes of data and would deal with very diverse demands to work with many varieties of sources of data as well as the end user base.
Raju is a data acquisition developer at Navy Federal Credit Union. He has over 20 years of diverse experience in project/program management, quality management, and data management. He holds many industry certifications including, CDMP, CBIP, CCP, PMP and CSQA. He can be reached at, [email protected]