Stop Mining Data!

December 14, 2011
114 Views

In some recent planning and architectural discussions I’ve become aware of the significant difference between reasoning about data and reasoning about the world that the data represents.

Let me give you an example. If I tell you that we have two statements and that one (A) comes from a reliable source (with some statistical assessment of the reliability) and the other (B) from a less reliable source, you may wish to assert the first statement and drop the second.

In some recent planning and architectural discussions I’ve become aware of the significant difference between reasoning about data and reasoning about the world that the data represents.

Let me give you an example. If I tell you that we have two statements and that one (A) comes from a reliable source (with some statistical assessment of the reliability) and the other (B) from a less reliable source, you may wish to assert the first statement and drop the second.

In this case, you have used information about the data to make a decision about what you believe to be the case (whatever the proposition in statement A might have been).

Now lets imagine that the two statements are about phone numbers and their relationship with a business entity. There is a very simple test I can make to discover if a phone number is operational or not. If I make that test then I am thinking about the relationship between the data (the statement) and the world.

The first type of reasoning is easy for a computer to do. We estimate some prior for a data source based on a sample of its statements and then apply that prior to the rest of the data. The second type is precisely what a human would do.

When building record linkage systems, we should always keep in mind the fact that the data is essentially a set of statements about the world and we are attempting to figure out – from the aggregate of statements – what the world looks like.