Information Theory Approach to Data Quality and MDM

Over the past decade, data quality has been a major focus for data management professionals, data government organizations, and other data quality stakeholders across the enterprise. Still the quality of data remains low for many organizations. To a considerable extent this is caused by a lack of scientifically or at least consistently defined data quality metrics. Data professionals are still lacking a common methodology that would enable them to measure data quality objectively in terms of scientifically defined metrics and compare data sets in terms of their quality across systems, departments and corporations.

Even though many data profiling metrics exist, their usage is not scientifically justified. Consequently enterprises and their departments apply their own standards or apply no standards at all.

As a result, regulatory agencies, executive management and data governance organizations are lacking a standard, objective and scientifically defined way to articulate data quality requirements and measure data quality improvement progress. An elusiveness of data quality results in that job performance of the enterprise roles responsible for data quality lacks …

Even though many data profiling metrics exist, their usage is not scientifically justified. Consequently enterprises and their departments apply their own standards or apply no standards at all.

A quantitative approach to data quality, if developed and adopted by data management community, would enable data professionals to better prioritize data quality issues and take corrective actions proactively and efficiently.

In this article we will discuss a scientific approach to data quality for MDM based on Information Theory. This approach seems to be a good candidate to address the aforementioned problem.

Approaches to Data Quality

At a high level there are two well-known and broadly used approaches to data quality. Typically both of them are used to a certain degree by every enterprise.

The first approach is mostly application driven and oftentimes referred to as a “fit-for-purpose” approach. Oftentimes business users determine that certain application queries or reports do not return the right data. For instance if a query that is supposed to fetch top 10 Q2 customers does not return some of the customers the business expects to see, in depth data analysis follows. The data analysis may determine that some customer records are duplicated and some transaction records have incorrect or missing transaction dates. This type of finding can trigger some activities aimed at understanding of the data issues and corrective actions.

An advantage of this approach to data quality is that it is aligned with tactical needs of business functions, groups and departments. A disadvantage of this approach is that it addresses data quality issues re-actively based upon business request or even complaint. Some data quality issues may not be easy to discover and business users cannot decide which report is right and which one is wrong. The organization may eventually draw a conclusion that their data is bad but would not be able to indicate what exactly needs to be fixed in the data, which limits the IT’s abilities to fix the issues. When multiple LOB’s and functions across the enterprise struggle with their specific data quality issues separately, it is difficult to quantify the overall state of data quality and define priorities with which data quality problems are to be addressed by the enterprise.

The second approach is based on data profiling. Data profiling tools are intended to make a data quality improvement process more pro-active and measurable. A number of data profiling metrics is typically introduced to screen for missing and invalid attributes, duplicate records, duplicate attribute values that are supposed to be unique, frequency of attributes, cardinality of attributes and their allowed values, standardization and validation of certain data formats for simple and complex attribute types, violations of referential integrity, etc. A limitation of the data profiling techniques is in that an additional analysis is required to understand which of the metrics are most important for the business and why. It may not be easy to come up with a definitive answer and translate it into a data quality improvement action plan. The variety of data profiling metrics is not based on science but rather driven by the variety of ways relational database technology can report on data quality issues.

Each of the two approaches above has its niche and significance. When the quality of master data is in question an alternative and more strategic approach can be considered by data governance organizations. This approach avoids detailed analysis of business applications while providing a solid scientific foundation for its metrics.

Information Theory Approach to Data Quality for MDM

Master data are those data which are foundational to business processes, are usually widely distributed, which, when well managed, are directly contributing
to the success of an organization, and when not well managed pose the most risk. Customer, Patient, Citizen, Member, Client, Member, Broker, Product, Financial Instrument, Drug are the entities oftentimes referred to as master data entities while company specific selection of master entities is driven by the enterprise business and focus.

Master Data Service (MDS) defines its primary function as the creation of the “golden view” of the master entities. We will assume that MDS has successfully created and maintains the “golden view” of entity F in the data hub. This “golden record” can be dynamic or persistent. There exist a number of data sources across the enterprise with the data corresponding to domain F. This includes the source systems that feed the data hub and other data sources that may be not integrated with the data hub. We will define an external dataset f which data quality is to be quantified with respect to F. For the purpose of this discussion f can represent any data set such as a single data source or multiple sources.

Our goal is to compare the source data set f with the entity data set F. The data quality of the data set f will be characterized by how well it represents the benchmark entity F defined as the “golden view” for the data in domain F. We are making an assumption here that the “golden view” was created algorithmically and then validated by the data stewards.

In Information Theory the information quantity associated with the entity F is expressed in terms of the entropy:

H(F) = – ∑ P_klog P_k,(1)

where P_k are the probabilities of the attribute (token) values in the “golden” data set F. Index “K” runs over all records in F and all attributes. The base in the log function is 2.

H(F) represents the quantity of information in the “golden” representation of entity F.

Similarly for the comparison data set f

H(f) = – ∑ p_ilog p_i,(2)

We will use small “p” for the probabilities associated with f while capital letter “P” is used for the probabilities characterizing the “golden” entity record.

Mutual entropy J(f,F) characterizes how well f represents F.

J(f,F) = H(f) + H(F) – H(f,F)(3)

In (3) H(f,F) is the joint entropy of f and F. It is expressed in terms of probabilities of combined events, e.g. the probability that the name = “Smith” in “the golden record” F and name = “Schmidt” in the source record linked to the same entity. The behavior of J qualifies this function as a good candidate quantifying the data quality of f with respect to F. When the data quality is low, the correlation between f and F is low. In an extreme case of a very low data quality f doesn’t correlate with F and these variables are independent. Then

H(f,F) = H(f) + H(F) (4)

and

J(f,F) = 0 (5)

If f represents F extremely well, e.g. f = F, then H(f) = H(F) = H(f,F) and

J(f,F) = H(F) (6)

We define Data Quality of f with respect to F by the following equation:

DQ(f,F) = J(f,F)/H(F) (7)

With this definition of data quality DQ changes from 0 to 1, where 0 indicates the data quality of f is minimal; f does not represent F. When DQ = 1 f perfectly represents F and the data quality of f with respect to F is 100%, and therefore f represents F perfectly well.

The approach can also be used to determine partial attribute/token level data quality. This will provide additional insights into what causes most significant data quality issues.

The data quality improvement should be done iteratively. Changes in the data source data may impact the “golden record”. Then equations (1) and (7) are applied again to recalculate the data quantity and data quality characteristics.

Conclusion

The article offers an Information Theory based method for quantifying Information Assets and the Data Quality of the Assets through equations (1) and (7). The proposed method leverages the notion of a “golden record” created and maintained in the data hub. The “golden record” is used as the benchmark against which the data quality of other sources is measured.

Organizations can leverage this approach to augment its data governance offerings for MDM and make our data governance approach truly unique. The quantitative approach to data quality will ultimately help data governance organizations develop policies based on scientifically defined data quality and quantity metrics.

By applying this approach consistently on a number of engagements, over time we will accumulate valuable insights into how metrics (1) and (7) apply to real world data characteristics and scenarios. We will develop good practices defining acceptable data quality thresholds, e.g. it might be a future industry policy for P&C insurance business to keep the quality of Customer data above the 92% mark, which sets clearly articulated data governance policy based on scientifically sound approach to data quality metrics.

The developed approach can be incorporated in the future products to enable data governance and provide data governance organizations with new tooling. Data governance will be able to select information sources and assets to be measured, quantify them according to (1) and (7), set the target metrics for data stewards, measure the progress on an on-going basis and report on the data quality improvement progress.

Even though we are mainly focusing on data quality, the quantity of data in equation (1) characterizes the overall significance of a corporate data set from the Information Theory perspective. For M&A the method can be used to measure an additional amount of information that the joint enterprise will have compared to the information owned by the companies separately. The approach developed above will measure both the information acquired due to the difference in the customer bases and the information quantity increment due to better and more precise and useful information about the existing customers.

Link to original post