Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Myth or Fact : The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions.
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Business Intelligence > Myth or Fact : The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions.
Business Intelligence

Myth or Fact : The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions.

RichardBoire
RichardBoire
12 Min Read
SHARE

 

 

Myth or Fact : The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions.

 

More Read

Grocery Data Streams Gain Value In Post-COVID Shopping Environment
Grocery Data Streams Gain Value In Post-COVID Shopping Environment
Get Ready For These Six 2020 Business Intelligence Trends
The Death of BIG Business intelligence
Data preprocessing for clustering: survey
Three New Data Mining Blogs

 

 

Myth or Fact : The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions.

 

 Data Mining practitioners will tell you that the much of the real value of their work is the ability to derive and create new variables from the source information within a database or file. For example, the calculation of averages or totals related to a specific timeframe or period  represents information that  is going to be derived in a database  and is unlikely to be directly extracted from  source information. Other  good examples include postal area variables based on 1st digit of the postal code. Although the cited examples above seem pretty simple and basic, it is not uncommon to have derived variables comprise over 90% of the information within an analytical exercise.  From the data analyst perspective, there are no limits to the number of derived variables that can be created. The limitation in variable creation is only confined to the imagination of the analyst or practitioner. Obviously, creating variables and adding new information in theory should provide incremental benefit to any data mining solution. But as with any exercise or project, there are diminishing returns as one begins to explore the many possibilities and permutations that exist within variable creation. In this article, we will attempt to address this issue not in an academic manner but in the practical sense of whether it provides business benefit  to a given data mining solution.  Specifically, we explore the impact of exploding the  number of variables versus our traditional techniques of variable creation.

 

The Traditional Techniques of Variable Creation

Our traditional techniques of variable creation attempt to yield  insights by looking at creating variables in the following fashion:

 

  • Binary Variables(yes/no outcome which represents the occurrence of some activity or event)
  • Average/Median or Sum of Variable
  • Index/Ordinal Variables whereby variable outcomes and values are placed into ranked groups. For example, age might be grouped into three outcomes with 1 being under 30 years, 2 being 31-50, and 3 being 51+.
  • Change/Velocity  Variables whereby variables are created that look at how behaviour has changed overtime.

Across these four areas, it is not unusual to discover that the analyst has created hundreds of variables for a given analytical exercise. But are there other transformations that we should consider when building solutions? The notion  of looking at a whole new suite of additional variables is that this potential new information can provide incremental benefit to a given solution..  In our research, here, we looked at predictive models that were built using our  traditional approach of variable creation as mentioned above.

 

 

 

Exploding the Number of Variables

We then looked at additional approaches which would explode the number of variables in our analytical file. Further mathematical transformations were employed which  consisted of the following:

  • Log Transformation
  • Square Root Transformation
  • Sine transformation
  • Cosine transformation
  • Tangent Transformations

 

We also looked at  combining pairs of top variables that were significant within the    correlation analysis of the given predictive model. A good example of this would be age and gender where we can actually capture the impact of  age and gender together and observe their impact on the modeled behaviour. In the variable pair  routines, we attempted to look at all possible combinations. If there are 20 variables that are derived using the traditional approach, then the potential number of possible variable pairs is 190(20 X 19)/2.

 

Both these transformations(mathematical and pairs) dramatically increase the number of variables. The number of variables increased to 100 using the  mathematical transformations(5 X20) and 190(as seen above) for the possible number of variable pair transformations. In this simple exercise, 20 variables using the traditional approach now explodes to  310(20+100+190).

 

Business Cases to Demonstrate the Point

The challenge, though, in this type of exercise is to determine if there is a real business benefit in exploding the number of variables from 20 to 310. Our approach in identifying this benefit was to look at models that were developed prior to this variable creation explosion and compare them to models that were developed after this variable creation explosion. Four models were looked at which represented models built by our organization:

 

  • Acquisition Model
  • Up-Sell Model
  • Cross-Sell Model
  • Insurance Loss  Model

 

Lift/gains charts were created when using just the 20 variables to build the above models  versus lift/gains charts  when using the more expanded number of variables(310) to build these above models. In the above four cases, we found that the more exhaustive or more expansive approach to variable creation produced no real incremental lift to any of the models.

 

But it is not unreasonable for one to acquire a deeper  understanding or some rationale as to why these results occur? Let’s try and address this by looking at  variable creation in each of its stages.

 

Considering Demographic variables

The first stage actually represents the source information or variables that are not manipulated by the analyst. Good examples of this are, tenure, income, household size,etc.  This information is  extremely useful as it captures the basic demographics of the individual.

 

The subsequent stages as outlined below all comprise derived variables which in most cases come from the purchase/billing systems from a company. For now, we are going to focus on stages that deal with the following:

 

  • Grouping of variables
  • Calculating basic arithmetic diagnostics(mean,median,standard deviation, min and max)
  • Calculating change variables  

 

Grouping of Variable Values

This stage looks at variables that represent the grouping of values into categories. A good example of this is postal code where postal codes are grouped into regions(i.e. all postal codes beginning with the 1st digit ‘M’ represent Toronto postal codes. Other examples might represent the grouping of specific product types or service codes into much broader product or service categories.    

 

Calculating basic arithmetic diagnostics(total, mean,median,standard deviation, min and max)

This stage deals with the ability to summarize numerical information into a meaningful metric. However, it is only meaningful if we have historical information that can be used to calculate these kind of diagnostics. For example, if we are calculating the average spend or the variation in spend, the question we must address is what timeframe are we looking at? Is the average or variation based on six months, 12 months, etc.?  

 

 

 

Calculating change variables  

Extending this logic of  using historical information, we may want to identify how summarized behaviour changes over time. Has spending or product purchase behaviour changed over a period of time. Has it changed drastically over the last 3 months, over the last 6 months, etc.

 

Putting some perspective on the Traditional Approach to Variable Creation

In all these stages , key information is being produced  that is unique in explaining the desired modeled behaviour. Let’s probe or explore this thinking in more detail. The source-level  information  in many cases yields demographic information such as age or the tenure of a customer which as we all know has represented key variables within  a given model at one time or another. Grouping of values such as postal codes into regions can demonstrate more meaningful insight when attempting to look at geography in a broader manner. Arithmetic diagnostics look at how summarized behaviour regarding key metrics can add value to a desired modeled behaviour. We all have seen examples of summarized metrics such as total products purchased or average spend as key variables within models. The diagnostic type variables differ from source variables in that source variables look at information at a point in time while diagnostics look at information over a period of time.  Meanwhile, the change type variables add another dimension in that we are looking at how this summarized behaviour changes over time. Each of these stages such as demographics, point in time variables from the source data, summarized data from calculating basic diagnostics, and change variables from the summarized data, represent unique ways of looking at the information. Because of these different unique perspectives, models typically will typically incorporate variables from each of these perspectives.

 

Extensive Variable Creation

By providing some rationalization that the creation of variables using the traditional approach adds significant value to the modeling process, one may begin to ask whether a more extensive process can continue to add value to the process. Our  results would  indicate that there is no real additional information which is unique and which can provide additional benefit to the model solution.  It is our contention that unique views of information represent the real nuggets which add value to a data mining solution. More extensive mathematical manipulation as well as variable combinations do not seem to provide significant incremental lift and therefore do not provide another unique view of the information.

 

What does this mean going forward?

Considering that variable creation can be the most laborious part of any data mining exercise, these kind of findings can provide some direction in how the analyst should best focus his or her time in any data mining project. Given the time pressures that analysts and practitioners face from business users, analysts can be more effective in  their variable creation efforts  and build solutions that are both timely and optimal.

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

How AI is helping the bloggers
Artificial IntelligenceExclusive

Here Is How Artificial Intelligence Can Make Blogging More Productive

8 Min Read

Breaking Down Data Silos to Help Drive BI Collaboration

0 Min Read
Image
Business Intelligence

BI Continues to Offer Major Benefits to Health Care

3 Min Read
AI productivity tools
Artificial Intelligence

AI Tools Can Help Facilitate Team Productivity While Working Remotely

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?