**Myth or Fact : The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions.**

** **

**Myth or Fact : The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions.**

** **

Data Mining practitioners will tell you that the much of the real value of their work is the ability to derive and create new variables from the source information within a database or file. For example, the calculation of averages or totals related to a specific timeframe or period represents information that is going to be derived in a database and is unlikely to be directly extracted from source information. Other good examples include postal area variables based on 1^{st} digit of the postal code. Although the cited examples above seem pretty simple and basic, it is not uncommon to have derived variables comprise over 90% of the information within an analytical exercise. From the data analyst perspective, there are no limits to the number of derived variables that can be created. The limitation in variable creation is only confined to the imagination of the analyst or practitioner. Obviously, creating variables and adding new information in theory should provide incremental benefit to any data mining solution. But as with any exercise or project, there are diminishing returns as one begins to explore the many possibilities and permutations that exist within variable creation. In this article, we will attempt to address this issue not in an academic manner but in the practical sense of whether it provides business benefit to a given data mining solution. Specifically, we explore the impact of exploding the number of variables versus our traditional techniques of variable creation.

**The Traditional Techniques of Variable Creation**

Our traditional techniques of variable creation attempt to yield insights by looking at creating variables in the following fashion:

- Binary Variables(yes/no outcome which represents the occurrence of some activity or event)
- Average/Median or Sum of Variable
- Index/Ordinal Variables whereby variable outcomes and values are placed into ranked groups. For example, age might be grouped into three outcomes with 1 being under 30 years, 2 being 31-50, and 3 being 51+.
- Change/Velocity Variables whereby variables are created that look at how behaviour has changed overtime.

Across these four areas, it is not unusual to discover that the analyst has created hundreds of variables for a given analytical exercise. But are there other transformations that we should consider when building solutions? The notion of looking at a whole new suite of additional variables is that this potential new information can provide incremental benefit to a given solution.. In our research, here, we looked at predictive models that were built using our traditional approach of variable creation as mentioned above.

** **

** **

** **

**Exploding the Number of Variables**

We then looked at additional approaches which would explode the number of variables in our analytical file. Further mathematical transformations were employed which consisted of the following:

- Log Transformation
- Square Root Transformation
- Sine transformation
- Cosine transformation
- Tangent Transformations

We also looked at combining pairs of top variables that were significant within the correlation analysis of the given predictive model. A good example of this would be age and gender where we can actually capture the impact of age and gender together and observe their impact on the modeled behaviour. In the variable pair routines, we attempted to look at all possible combinations. If there are 20 variables that are derived using the traditional approach, then the potential number of possible variable pairs is 190(20 X 19)/2.

Both these transformations(mathematical and pairs) dramatically increase the number of variables. The number of variables increased to 100 using the mathematical transformations(5 X20) and 190(as seen above) for the possible number of variable pair transformations. In this simple exercise, 20 variables using the traditional approach now explodes to 310(20+100+190).

**Business Cases to Demonstrate the Point**

The challenge, though, in this type of exercise is to determine if there is a real business benefit in exploding the number of variables from 20 to 310. Our approach in identifying this benefit was to look at models that were developed prior to this variable creation explosion and compare them to models that were developed after this variable creation explosion. Four models were looked at which represented models built by our organization:

- Acquisition Model
- Up-Sell Model
- Cross-Sell Model
- Insurance Loss Model

Lift/gains charts were created when using just the 20 variables to build the above models versus lift/gains charts when using the more expanded number of variables(310) to build these above models. In the above four cases, we found that the more exhaustive or more expansive approach to variable creation produced no real incremental lift to any of the models.

But it is not unreasonable for one to acquire a deeper understanding or some rationale as to why these results occur? Let’s try and address this by looking at variable creation in each of its stages.

**Considering Demographic variables **

The first stage actually represents the source information or variables that are not manipulated by the analyst. Good examples of this are, tenure, income, household size,etc. This information is extremely useful as it captures the basic demographics of the individual.

The subsequent stages as outlined below all comprise derived variables which in most cases come from the purchase/billing systems from a company. For now, we are going to focus on stages that deal with the following:

- Grouping of variables
- Calculating basic arithmetic diagnostics(mean,median,standard deviation, min and max)
- Calculating change variables

**Grouping of Variable Values**

This stage looks at variables that represent the grouping of values into categories. A good example of this is postal code where postal codes are grouped into regions(i.e. all postal codes beginning with the 1^{st} digit ‘M’ represent Toronto postal codes. Other examples might represent the grouping of specific product types or service codes into much broader product or service categories.

**Calculating basic arithmetic diagnostics(total, mean,median,standard deviation, min and max)**

This stage deals with the ability to summarize numerical information into a meaningful metric. However, it is only meaningful if we have historical information that can be used to calculate these kind of diagnostics. For example, if we are calculating the average spend or the variation in spend, the question we must address is what timeframe are we looking at? Is the average or variation based on six months, 12 months, etc.?

**Calculating change variables **

Extending this logic of using historical information, we may want to identify how summarized behaviour changes over time. Has spending or product purchase behaviour changed over a period of time. Has it changed drastically over the last 3 months, over the last 6 months, etc.

**Putting some perspective on the Traditional Approach to Variable Creation**

In all these stages , key information is being produced that is unique in explaining the desired modeled behaviour. Let’s probe or explore this thinking in more detail. The source-level information in many cases yields demographic information such as age or the tenure of a customer which as we all know has represented key variables within a given model at one time or another. Grouping of values such as postal codes into regions can demonstrate more meaningful insight when attempting to look at geography in a broader manner. Arithmetic diagnostics look at how summarized behaviour regarding key metrics can add value to a desired modeled behaviour. We all have seen examples of summarized metrics such as total products purchased or average spend as key variables within models. The diagnostic type variables differ from source variables in that source variables look at information at a point in time while diagnostics look at information over a period of time. Meanwhile, the change type variables add another dimension in that we are looking at how this summarized behaviour changes over time. Each of these stages such as demographics, point in time variables from the source data, summarized data from calculating basic diagnostics, and change variables from the summarized data, represent unique ways of looking at the information. Because of these different unique perspectives, models typically will typically incorporate variables from each of these perspectives.

**Extensive Variable Creation**

By providing some rationalization that the creation of variables using the traditional approach adds significant value to the modeling process, one may begin to ask whether a more extensive process can continue to add value to the process. Our results would indicate that there is no real additional information which is unique and which can provide additional benefit to the model solution. It is our contention that unique views of information represent the real nuggets which add value to a data mining solution. More extensive mathematical manipulation as well as variable combinations do not seem to provide significant incremental lift and therefore do not provide another unique view of the information.

**What does this mean going forward?**

Considering that variable creation can be the most laborious part of any data mining exercise, these kind of findings can provide some direction in how the analyst should best focus his or her time in any data mining project. Given the time pressures that analysts and practitioners face from business users, analysts can be more effective in their variable creation efforts and build solutions that are both timely and optimal.