Essential Elements of Data Mining

This is my attempt to clarify what Data Mining is and what it isn’t. According to Wikipedia, “In philosophy, essentialism is the view that, for any specific kind of entity, there is a set of characteristics or properties all of which any entity of that kind must possess.” I do not seek the Platonic form of Data Mining, but I do seek clarity where it is often lacking. There is much confusion surrounding how Data Mining is distinct from related areas like Statistics and Business Intelligence. My primary goal is to clarify the characteristics that a project must have to be a Data Mining project. By implication, Statistical Analysis (hypothesis testing), Business Intelligence reporting, Exploratory Data Analysis, etc., do not have all of these defining properties. They are highly valuable, but have their own unique characteristics. I have come up with ten. It is quite appropriate to emphasize the first and the last. They are the bookends of the list, and they capture the heart of the matter.

1) A Question
2) History
3) A Flat File
4) Computers
5) Knowledge of the Domain
6) A lot of Time
7) Nothing to Prove
8) Proof that you are Right
9) Surprise
10) Something to Gain

1) A Question: Data Mining is not an unfocused search for anything interesting. It is a method for answering a specific question, meeting a particular need. Getting new customers is not the same as keeping the customers you already have. Of course, they are similar, but different in both big and subtle ways. The bottom line is that every decision that you make about the data that you select and assemble flows from the business question.

2) History: Data Mining is not primarily about the present tense, which contrasts it from Business Intelligence reporting. It is about using the past to predict the future. How far into the past? Well, if your customers sign a 12 month contract than it is probably more than 12 months old. It must be old enough to have a cohort of customers that have started and ended a process that is ongoing. Did they renew? Did they churn? You need a group of records for which the outcome of the process is known historically. This outcome status is usually in the form of a Target or Dependent Variable. It is the corner stone of the data set that one must create, and is the key to virtually all Data Mining projects.

3) A Flat File: Data Miners are not in the Dark Ages. They work with relational databases on a daily basis, but the algorithms that are used are designed to run on flat files. Software vendors are proud to tout “in database modeling,” and it is exciting for its speed, but you still have to build a flat file that has all of your records and characteristics in one table. The Data Miner and author Gordon Linoff calls this a “customer signature.” I rather prefer the idea of a customer “footprint” as it always involves an accumulation of facts over time. The resulting flat file will be unique to the project, specifically built to allow the particular questions of the Data Mining project to be answered.

4) Computers: Data Mining data sets are not always huge. Sometimes they are in the low thousands, and sometimes a carefully selected sample of a few percent of your data is plenty to find patterns. So, despite all the talk of Big Data, the size of the data file is not really a limiting factor on today’s machines. Statistics software packages were capable of running a plain vanilla regression on larger data sets decades ago. The real thing that separates Data Mining from R. A. Fisher and his barley data set is that Data Mining algorithms are highly iterative. Considerable computing power is needed to find the best predictors and try them in all possible combinations using a myriad of different strategies. Data Mining is not simply Statistics on Big Data. Data Mining algorithms were created in a post computing environment to solve post computing problems. They are qualitatively different from traditional statistical techniques in fundamental and important ways, and even when traditional techniques are used they are used in the service of substantively different purposes.

5) Knowledge of the Domain: A sales rep once told me a story, probably apocryphal, about the early days of the Data Mining software I use. A banking client wanted to put them to the test, so the client said: “Here are some unlabeled variables. We are going to keep the meaning of them secret. Tell us which are the best predictors of variable X. If you answer ‘correctly’, we will buy.” What a horrible idea! The Data Mining algorithms play an important role in guiding the model building process, but only the human partner in the process can be the final arbiter of what best meets the need of the business problem. There must be business context, and if the nature of the data requires it, that context might involve Doctors, Engineers, Call Center Managers, Insurance Auditors or a host of other specialists.

6) A Lot of Time: Data Mining projects take time, a lot of time. They take many weeks, and perhaps quite a few months. If someone asks a Data Miner if they can have something preliminary in a week, they are thinking about something other than Data Mining. Maybe they really mean generating a report, but they don’t mean Data Mining. Problem definition takes time because it involves a lot of people, assembled together, hashing out priorities, figuring out who is in charge of what. With this collaboration, the project lead can’t easily make up lost time by burning the midnight oil. Data Preparation takes much of the time. Perhaps you assume that you will be Mining the unaltered contents of your Data Warehouse. It was created to support BI Reporting, not to support Data Mining, so that is not going to happen. Finally, when you’ve got something interesting, you have to reconvene a lot of people again, and you aren’t done until you have deployed something, making it part of the decision management engines of the business. (See Element 10.)

7) Nothing to Prove: If you are verifying an outcome, certain that you are right, having carefully chosen predictors in advance, simply curious how well it fits, you aren’t doing Data Mining. Perhaps you are merely exploring the data in advance, biding you time, waiting until your deadline approaches and then using hypothesis testing to congratulate yourself on how successfully your model fits data that you explored. This is, of course, the worst possible combination of Statistics and Data Mining imaginable, and violates the most basic assumptions of hypothesis testing. Neither of these approaches are Data Mining.

8) Proof that you are Right: Data Mining, by its very nature does not have a priori hypotheses, but it does need proof. A contradiction? The most fundamental requirement of Data Mining is that the same data which was used to uncover the pattern must never be used to prove that the pattern applies to future data. The standard way of doing this is to divide ones data randomly into two portions, building the model on the Train data set, verifying the model on the Test data set. In this is found the essence of Data Mining because it gives one freedom to explore the Train data set, uncovering its mysteries, awaiting the eventually judgement of the Test data set.

9) Surprise: A common mistake in Data Mining is being too frugal with predictors, leaving out this or that variable because “everyone knows” that it is not a key driver. Not wise. Even if this is true, it discounts the insight that an unanticipated interaction might provide. Even if true, it is a needless precaution because Data Mining algorithms are designed to be resilient to large numbers of related predictors. This is not to say that feature selection is not important – it is a key skill – but rather that Data Miners must be cautious when removing variables. Each of those variables cost the business money to record, and the insights they might offer have monetary value as well. Doing variable reduction well in Data Mining is in striking contrast with doing variable reduction well in Statistics.

10) Something to Gain: It might be somewhat controversial, but I think not overly so, to establish an equivalence: Data Mining Equals Deployment. Without deployment, you have may have done something valuable, perhaps even accompanied with demonstrable ROI, but you have fallen short. You may have reached a milestone. You may even have met the specific requirements of your assignment, but it isn’t really Data Mining until it is deployed. The whole idea of Data Mining is taking a carefully crafted snapshot, a chunk of history, establishing a set of Best Practices, and inserting them in the flow of Decision Making of the business.

The issue of clarifying what Data Mining is (and what to call it) comes up in conversation often among Data Miners so I hope the community of data analysts will find this a worthy enterprise. I intend to present this list to new Data Miners when I met them in a tool neutral setting. Please do provide your feedback. Would you add to the list? Do you think that there any properties that are listed here that are not required to call a project Data Mining?