The Very True Fear of False Positives

July 18, 2009
105 Views

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words.  Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g.

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words.  Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g. numeric probabilities, weighted percentages, odds ratios, confidence levels).  Ranking is often used as a primary method in differentiating the three possible result categories:

  1. Automatic Matches
  2. Automatic Non-Matches
  3. Potential Matches requiring manual review

All data matching techniques must also face the daunting challenge of what I refer to as The Two Headed Monster:

  • False Negatives – records that did not match, but should have been matched
  • False Positives – records that matched, but should not have been matched

For data examples that illustrate the challenge of false negatives and false positives, please refer to my Data Quality Pro articles:

 

Data Matching Techniques

Industry analysts, experts, vendors and consultants often engage in heated debates about the different approaches to data matching.  I have personally participated in many of these debates and I certainly have my own strong opinions based on over 15 years of professional services, application development and software engineering experience with data matching. 

However, I am not going to try to convince you which data matching technique provides the superior solution at least not until Doc Brown and I get our patent pending prototype of the redundant data capacitor working because I firmly believe in the following two things:

  1. Any opinion is biased by the practical limits of personal experience and motivated by the kind folks paying your salary
  2. There is no such thing as the best data matching technique every data matching technique has its pros and cons

But in the interests of full disclosure, the voices in my head have advised me to inform you that I have spent most of my career in the Fellegi-Sunter fan club.  Therefore, I will freely admit to having a strong bias for data matching software that uses probabilistic record linkage techniques. 

However, I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors.  Without exception, I have always been able to obtain the desired results regardless of the data matching techniques provided by the software.

For more detailed information about data matching techniques, please refer to the Additional Resources listed below.

 

The Very True Fear of False Positives

Fundamentally, the primary business problem being solved by data matching is the reduction of false negatives the identification of records within and across existing systems not currently linked that are preventing the enterprise from understanding the true data relationships that exist in their information assets.

However, the pursuit to reduce false negatives carries with it the risk of creating false positives. 

In my experience, I have found that clients are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by data matching software, than they are about the false negatives not linked after all, those records were not linked before investing in the data matching software.  Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

The very true fear of false positives often motivates the implementation of an overly cautious approach to data matching that results in the perpetuation of false negatives.  Furthermore, this often restricts the implementation to exact (or near-exact) matching techniques and ignores the more robust capabilities of the data matching software to find potential matches.

When this happens, many points in the heated debate about the different approaches to data matching are rendered moot.  In fact, one of the industry’s dirty little secrets is that many data matching applications could have been successfully implemented without the investment in data matching software because of the overly cautious configuration of the matching criteria.

My point is neither to discourage the purchase of data matching software, nor to suggest that the very true fear of false positives should simply be accepted. 

My point is that data matching debates often ignore this pragmatic concern.  It is these human and business factors and not just the technology itself that need to be taken into consideration when planning a data matching implementation. 

While acknowledging the very true fear of false positives, I try to help my clients believe that this fear can and should be overcome.  The harsh reality is that there is no perfect data matching solution.  The risk of false positives can be mitigated but never eliminated.  However, the risks inherent in data matching are worth the rewards.

Data matching must be understood to be just as much about art and philosophy as it is about science and technology.

 

Additional Resources

Data Quality and Record Linkage Techniques

The Art of Data Matching

Identifying Duplicate Customer Records – Case Study

Narrative Fallacy and Data Matching

Speaking of Narrative Fallacy

The Myth of Matching: Why We Need Entity Resolution

The Human Element in Identity Resolution

Probabilistic Matching: Sounds like a good idea, but…

Probabilistic Matching: Part Two

Link to original post