Data Mining Fundamentals: Khabaza’s 9 Laws of Data Mining

Once, the term “data mining” was only used to describe a specific new culture of business people using innovative methods of uncovering useful patterns in data. The tools of the data miner were quite distinct from those of the traditionally trained analyst. When you heard “data mining,” you could be sure of knowing what it meant.

Today, the term “data mining” is thrown around very casually, and it may be used to describe anything from a business person using modern pattern recognition methods to a database analyst making SQL queries. As interest in data mining grew, lots of people started claiming, “Us, too! We’ve got data mining, too!” Much of this change has been driven by vendors who didn’t want to be left out of the data mining party, but didn’t want to invest in new tools and processes, either.

A lot of what passes for data mining today is no more than reporting – describing what has already happened, but not in a way that is revealing about what is likely to happen next.

It’s time to revive and embrace the true essence of data mining – enabling people with business knowledge to discover useful patterns in data, extracting information that they can use to understand, influence and improve business processes. The heart of the data mining philosophy has been expressed best by a pioneer of data mining, Tom Khabaza, in his now classic “9 Laws of Data Mining.”

Lately, I have seen some of these posted and discussed in bits and pieces, without reference to the man who stated them so succinctly and made them known in the data mining community. That’s a shame, because we have a lot to learn from reviewing the 9 Laws as a full set, and because Tom Khabaza is an innovator you ought to know. He was one of the earliest data miners and one of the developers of the Clementine data mining workbench. When you hear of widespread use of data mining in telecom and law enforcement today, know that it was Tom Khabaza who broke that ground first.

Here is an overview of Tom Khabaza’s “9 Laws of Data Mining.”

1st Law of Data Mining, or “Business Goals Law”: Business objectives are the origin of every data mining solution

We explore data to find information that helps us run the business better. Shouldn’t this be the mantra of all business data analysis?

It’s significant that this law comes first. Everyone should understand that data mining is a process with a purpose. Real miners don’t play in dirt, they follow a methodical process to uncover specific valuable material. Data miners also follow methodical processes in search of what’s valuable to them.

Quoting Tom Khabaza: “Data mining is not primarily the technology, it is the process, which has one or more business objectives at its heart. Without a business objective … there is no data mining.”

2nd Law of Data Mining, or “Business Knowledge Law”: Business Knowledge is central to every step of the data mining process

There’s a horrible misconception floating around – that data mining doesn’t require the investigator to know anything. This is a misinterpretation of the true philosophy of data mining, that discovery of useful patterns in data can and should be put in the hands of business people who are not formally trained statisticians. Data mining is meant to bring power to the people – business people – who use their business knowledge, experience and insight, along with data mining methods, to find meaning in data.

3rd Law of Data Mining or “Data Preparation Law”: Data preparation is more than half of every data mining process

This should come as no surprise to anyone with experience dealing with data, whether as a data miner, a traditional analyst, or in another role. However, this is another area where there is mythology surrounding data mining, implying that data mining overcomes all issues of data quality and completeness. This myth was propagated by some long-forgotten vendors of data mining products, but the data mining community is still working to set the record straight. Data mining calls for good data.

But there’s more to it than just needing good data. Manipulation of the data is an important part of the data miner’s process. Here’s how Tom Khabaza explains it:

“The reason is deeper than the state of the data: during data preparation, the data miner customizes the problem space.

There are two aspects to this “problem space shaping”. First the data miner must put the data in a suitable form for the algorithms to use at all – for many algorithms this means one row per example. Secondly, the data miner makes it easier for the algorithm to find a solution by enhancing the data with useful information or by putting the information into a helpful form. Examples include calculated fields, binning and calculating date and time differences.”

4th Law of Data Mining, “NFL-DM”: The right model for a given application can only be discovered by experiment

(NFL-DM= “There is No Free Lunch for the Data Miner”)

Here we could begin some very colorful discussion. At the end of this article, I’ll direct you to some places where you can read and participate in such discussions. For now, it is important that you simply understand that experimentation is central to data mining philosophy and practice.

5th Law of Data Mining, or “Watkins’ Law”: There are always patterns

The practical experience of data miners is that useful patterns are consistently found when data is explored.

[The “Watkins” mentioned here refers to David Watkins, also a well-known data miner and one of the developers of Clementine.]

6th Law of Data Mining: Data mining amplifies perception in the business domain

This law speaks to the benefits of data mining algorithms and processes – they bring to light patterns in the data that would otherwise have gone undiscovered.

7th Law of Data Mining or “Prediction Law”: Prediction increases information locally by generalization

This is the law that I have found most challenging to clarify in my own mind, but here goes:

Data mining offers us ways to look at a case whose outcome is unknown, and find similarities to past cases where the outcome is known. By understanding those similarities, we gain information about likely outcomes for new cases.

8th Law of Data Mining, or “Value Law”: The value of data mining results is not determined by the accuracy or stability of predictive models

The real value of the process is in filling a business need. Accuracy or stability in a model are good, of course, but may be less important than issues such as the importance of predicted values to a business, meaningful insights, or the ease of putting the predictions to use.

9th Law of Data Mining, or “Law of Change”: All patterns are subject to change

A model that has great business value today may be just another old model tomorrow. Business does not sit still, neither can data miners.

There you have them – maxims that sum up the current thinking as well as some origins of data mining.

Would you like to learn more? Ask questions? Challenge some points? Here are some of the best places:

9 Laws of Data Mining, a LinkedIn group http://linkd.in/9lawsli

Tom Khabaza website http://www.khabaza.com/

…and, of course, your comments are welcome here.