Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
    data analytics for trademark registration
    Optimizing Trademark Registration with Data Analytics
    6 Min Read
    data analytics for finding zip codes
    Unlocking Zip Code Insights with Data Analytics
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: The Very True Fear of False Positives
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Uncategorized > The Very True Fear of False Positives
Uncategorized

The Very True Fear of False Positives

JimHarris
JimHarris
9 Min Read
SHARE

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

Contents
Data Matching TechniquesThe Very True Fear of False PositivesAdditional Resources

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words.  Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

More Read

Delivers the Right Toys and Goodies to the Right Boys and Girls: Story of Santa and SAS
Posterous DOS attack. Someone should write the story
Reading – Viral Data in SOA: An Enterprise Pandemic
Video’s role in the social enterprise
Efficient DAM Systems for Revitalizing the Content Marketing Platform

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g. …

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words.  Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g. numeric probabilities, weighted percentages, odds ratios, confidence levels).  Ranking is often used as a primary method in differentiating the three possible result categories:

  1. Automatic Matches
  2. Automatic Non-Matches
  3. Potential Matches requiring manual review

All data matching techniques must also face the daunting challenge of what I refer to as The Two Headed Monster:

  • False Negatives – records that did not match, but should have been matched
  • False Positives – records that matched, but should not have been matched

For data examples that illustrate the challenge of false negatives and false positives, please refer to my Data Quality Pro articles:

  • Identifying Duplicate Customers (Part 2): False Negatives
  • Identifying Duplicate Customers (Part 3): False Positives

 

Data Matching Techniques

Industry analysts, experts, vendors and consultants often engage in heated debates about the different approaches to data matching.  I have personally participated in many of these debates and I certainly have my own strong opinions based on over 15 years of professional services, application development and software engineering experience with data matching. 

However, I am not going to try to convince you which data matching technique provides the superior solution – at least not until Doc Brown and I get our patent pending prototype of the redundant data capacitor working – because I firmly believe in the following two things:

  1. Any opinion is biased by the practical limits of personal experience and motivated by the kind folks paying your salary
  2. There is no such thing as the best data matching technique – every data matching technique has its pros and cons

But in the interests of full disclosure, the voices in my head have advised me to inform you that I have spent most of my career in the Fellegi-Sunter fan club.  Therefore, I will freely admit to having a strong bias for data matching software that uses probabilistic record linkage techniques. 

However, I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors.  Without exception, I have always been able to obtain the desired results regardless of the data matching techniques provided by the software.

For more detailed information about data matching techniques, please refer to the Additional Resources listed below.

 

The Very True Fear of False Positives

Fundamentally, the primary business problem being solved by data matching is the reduction of false negatives – the identification of records within and across existing systems not currently linked that are preventing the enterprise from understanding the true data relationships that exist in their information assets.

However, the pursuit to reduce false negatives carries with it the risk of creating false positives. 

In my experience, I have found that clients are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by data matching software, than they are about the false negatives not linked – after all, those records were not linked before investing in the data matching software.  Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

The very true fear of false positives often motivates the implementation of an overly cautious approach to data matching that results in the perpetuation of false negatives.  Furthermore, this often restricts the implementation to exact (or near-exact) matching techniques and ignores the more robust capabilities of the data matching software to find potential matches.

When this happens, many points in the heated debate about the different approaches to data matching are rendered moot.  In fact, one of the industry’s dirty little secrets is that many data matching applications could have been successfully implemented without the investment in data matching software because of the overly cautious configuration of the matching criteria.

My point is neither to discourage the purchase of data matching software, nor to suggest that the very true fear of false positives should simply be accepted. 

My point is that data matching debates often ignore this pragmatic concern.  It is these human and business factors and not just the technology itself that need to be taken into consideration when planning a data matching implementation. 

While acknowledging the very true fear of false positives, I try to help my clients believe that this fear can and should be overcome.  The harsh reality is that there is no perfect data matching solution.  The risk of false positives can be mitigated but never eliminated.  However, the risks inherent in data matching are worth the rewards.

Data matching must be understood to be just as much about art and philosophy as it is about science and technology.

 

Additional Resources

Data Quality and Record Linkage Techniques

The Art of Data Matching

Identifying Duplicate Customer Records – Case Study

Narrative Fallacy and Data Matching

Speaking of Narrative Fallacy

The Myth of Matching: Why We Need Entity Resolution

The Human Element in Identity Resolution

Probabilistic Matching: Sounds like a good idea, but…

Probabilistic Matching: Part Two

Link to original post

TAGGED:data matchingdata quality
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

crypto marketing
How a Crypto Marketing Agency Can Use AI to Create Powerful Native Advertising Strategies
Blockchain Exclusive Marketing
data driven insights
How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
Analytics Big Data Exclusive
image fx (37)
Boosting SMS Marketing Efficiency with AI Automation
Exclusive
pexels pavel danilyuk 8112119
Data Analytics Is Revolutionizing Medical Credentialing
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Leadership Lessons in Data Quality – Part 2

5 Min Read

Here are 6 of 17 Best Practices

5 Min Read

Promoting Poor Data Quality

10 Min Read

BI 2010 – Some thoughts on data quality and governance

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?