Data Integration Roadmap to Support Big Data and Analytics
Traditional extract, transform and load (ETL) has existed since the times when data warehousing evolved to help move data from legacy mainframe applications. Therefore, data movement from files to relational or dimensional databases for the consumption by reporting engines has been the focus of ETL. Even in the data world today where most focus has been on data visualization or analytics or business intelligence, data professionals recognize the importance of effective ETL engines as the backbone. However, with changes such as widespread data access points, diverse data sources and unstructured data, expectations on data connections interfaces have moved towards data integration rather than traditional data movement.
It is inspiring to read the new TDWI publication by David Loshin, "Satisfying New Requirements for Data Integration", that briefly highlights changing demands on data integration as a checklist report. There were seven demands listed in this report; a) increase performance and efficiency, b) integrate the cloud, c) protect information in the integration layer, d) embed master data services, e) process big data and enterprise data, f) satisfy real-time demands and g) develop data quality and data governance policies and practices.
While Loshin identified very well the changing demands on legacy ETL platforms in this publication, it is still a presentation of the future wishful state rather than the path organizations can take to build data integration framework that can sustain the evolving needs. The following is the five step roadmap with specific measures organizations can take as they move towards that future state.
Step 1: get the foundation strong
Establishing a strong data quality and governance organization is perhaps the first foundation needed for data organizations aspiring to transition from mere data moving / storing entity to an information enablement engine. The data integration platform should enforce the policies and practices the organization establishes. ETL infrastructure gets the first look at the nature and volume of the data quality problems of source systems as they are integrated with rest of the organization. The traditional approach in ETL has been finding the workarounds to push the data through by making some tradeoffs. However, these chokepoints, AKA fault tolerance gates, should be re-examined to feed the data quality and integrity problems they reveal into the data governance organization. This does not mean that the organization cannot move to the next step unless they resolve all the data quality issues, but asks to establish visibility and have proper governance to process the data integrity issues of the organization.
Step 2: get serious about information security
Traditionally, ETL engines land sensitive data and after use do not always discard it from the logs and temporary staging areas. Access, authorization and authentication are compromised when multiple people have ability to use service accounts. Also, when production data is refreshed into test or development environments, scrubbing the data to de-sensitize it is often ignored. Information security especially within the ETL world needs very thorough audits and controls to ensure security policies are enforced. Without this, enabling a wide spread data integration infrastructure can multiply these vulnerabilities and conceivably could be fatal to the organization.
Step 3: smarter master data and graceful validation services
Most Master data management (MDM) implementations continue to remain static and user managed. However, when used well, ETL infrastructure can implement an active and evolving master data management system. Therefore, one of the first steps organizations are leveraging is to integrate the MDM tools and methods with the ETL engines. Also, ETL engines are increasingly integrating with geospatial validation software or data mapping / translation engines for enforcing data integrity. This is enabling the interfaces to be a bit more graceful and not become chokes when dealing with bad data. There are always strong arguments on what ETL should or should not do to data. However, ETL’s tradition role of moving the data without touching it is getting replaced with integrating data into the organizational information web. These steps can lay down the path for ETL engines as they form the organizational data integration architecture.
Step 4: upgrade the data integration infrastructure with the future in mind
When budgeting ETL infrastructure, most organizations use feedback mechanisms (what went wrong in the past) rather than the feed-forward mechanisms (what needs to go right in the future.) As a result, businesses often find themselves trying to find shortcuts to meet their changing demands with unstructured infrastructure. Traditional ETL environments always lag behind in order to catch up with the damage to data and process integrity caused by such short sighted temporary investments. Therefore, a major part of transforming an ETL organization to a data integration organization involves strategic investment decisions on the fundamental infrastructure needs of the future establishment. For example, when integration with cloud or real-time active data warehousing is on the horizon, the infrastructure investment decisions have to be taken now rather than waiting until the last hour. This calls for program management thinking and not infrastructure support mindset while budgeting.
Step 5: enable expanded data integration
Organizations that achieved progress in the previous steps can then think of how integration with cloud and big data analytics or mobile / self-service business intelligence needs be met by their data integration infrastructure. As Loshin explained in his article, mounds of structured data, unstructured data, big data, and advancements in cloud technology coupled with end user driven needs such as mobile BI, self-service BI, real-time reporting, advanced visualization techniques, are rapidly expanding the need for data integration competence well beyond what the traditional data movement ETL engines have to offer. At this stage, the data integration architecture has the necessary security framework, graceful validation to support the unexpected behaviors in data feeds, ability to integrate with and build organizational master data and the required strategic programs in place to support an organizational enablement demanded.
Traditional ETL infrastructure and processes need a clear roadmap, to consider expected future demands rather than reacting to issues / challenges faced in the past. Building the data integration infrastructure that can support future business needs should be managed as a program with step-by-step evolution. Data integration infrastructure should support new data sources from cloud, unstructured data or big data. Also, data integration infrastructure should be able to support real-time needs for data, mobile business intelligence, information access and performance demands, information security needs, and analytics. The steps described in this article can provide vision into a roadmap as the traditional ETL infrastructures transition to become the data integration services providers.
Raju is a data acquisition developer at Navy Federal Credit Union. He has over 20 years of diverse experience in project/program management, quality management, and data management. He holds many industry certifications including, CDMP, CBIP, CCP, PMP and CSQA. He can be reached at, [email protected]