Need for a Robust Data Quality Framework for Big Data

January 2, 2012
167 Views

The challenges associated with data quality and corresponding accountability across business domains and research areas has been a concern. Among the key data quality problems associated are:-

The challenges associated with data quality and corresponding accountability across business domains and research areas has been a concern. Among the key data quality problems associated are:-

  • Non-interoperability – Data collected in one system are not electronically transmittable to other systems. Re-inputting the same data in multiple systems consumes resources and increases the potential for data-entry errors.
  • Non-standardized data definitions – Various data providers use different definitions for the same elements. Passed on to the district or state level, non-comparable data are aggregated inappropriately to produce inaccurate results.
  • Unavailability of data – Data required do not exist or are not readily accessible ecause of one or other quality issue. In some cases, data providers may take an approach of “just fill something in” to satisfy distant data collectors, thus creating errors.
  • Inconsistent item response – Not all data providers report the same data elements. Idiosyncratic reporting of different types of information from different sources creates gaps and errors in macro-level data aggregation.
  • Inconsistency over time. The same data element is calculated, defined, and/or reported differently from year to year. Longitudinal inconsistency creates the potential for inaccurate analysis of trends over time.
  • Data entry errors. Inaccurate data are entered into a data collection instrument. Errors in reporting information can occur at any point in the process – from the student’s assessment answer sheet to the state’s report to the federal government.
  • Lack of timeliness. Data are reported too late. Late reporting can jeopardize the completeness of macro-level reporting.

We seriously require some thoughts and readily implementable approach where key business rules can be defined just like other business rules; ensuring proactive reporting of quality issues, checkpoints on new data being inserted and so on.

Imagine, if we have a framework which can ensure some of following validation rules:-

  1. Range Check – This checks that the data lies within a specified range of values
  2. Presence Check – This checks that the required data is not missing
  3. Domain Check – This checks that only certain values are accepted
  4. Cross-Field Check – This checks that multiple fields in combination are valid
  5. Cross-Table Check – This checks that multiple tables in combination are valid
  6. Uniqueness Validation – Ensure the values in a column are unique
  7. Reference Integrity Validation – Validate values between tables in relational database model
  8. Duplicate Identification – Identify a row as an unwanted duplicate record
  9. Format Consolidation – Control data values inside a preset mask pattern
  10. Business Rule Compliance