Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    composable analytics
    How Composable Analytics Unlocks Modular Agility for Data Teams
    9 Min Read
    data mining to find the right poly bag makers
    Using Data Analytics to Choose the Best Poly Mailer Bags
    12 Min Read
    data analytics for pharmacy trends
    How Data Analytics Is Tracking Trends in the Pharmacy Industry
    5 Min Read
    car expense data analytics
    Data Analytics for Smarter Vehicle Expense Management
    10 Min Read
    image fx (60)
    Data Analytics Driving the Modern E-commerce Warehouse
    13 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Why Defining the Target Variable in Predictive Analytics is Critical
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > Predictive Analytics > Why Defining the Target Variable in Predictive Analytics is Critical
Predictive Analytics

Why Defining the Target Variable in Predictive Analytics is Critical

DeanAbbott
DeanAbbott
6 Min Read
SHARE

Every data mining project begins with defining what problem will be solved. I won’t describe the CRISP-DM process here, but I use that general framework often when working with customers so they have an idea of the process.

Every data mining project begins with defining what problem will be solved. I won’t describe the CRISP-DM process here, but I use that general framework often when working with customers so they have an idea of the process.

Part of the problem definition is defining the target variable. I argue that this is the most critical step in the process that relates to the data, and more important than data preparation, missing value imputation, and the algorithm that is used to build models, as important as they all are.

The target variable carries with it allthe information that summarizes the outcome we would like to predict from the perspective of the algorithms we use to build the predictive models. Yet this can be misleading is many ways. I’m addressing one way we can be fooled by the target variable here, and please indulge me to lead you down the path.

More Read

Preprocessing – Feature Generation
Some thoughts on advanced analytics in 2010
Top Ten Posts from Trends and Outliers in 2010
“A a fisherman miles off the coast of Galway hauls in his nets and assesses his catch, he pulls out…”
Successful Business Intelligence Projects: The Role of Managers and Leaders

Let’s say we are building fraud models in our organization. Let’s assume that in our organization, the process for determining fraud is first to identify possible fraud cases (by tips or predictive models), then assign the case to a manager who determines which investigator will get the case (assuming the manager believes there is value in investigating the case), then assign the case to an investigator, and if fraud is found, the case is tried in court, and ultimately a conviction is made or the party is found not guilty.

Our organization would like to prioritize which cases should be sent to investigators using predictive modeling. It is decided that we will use as a target variable all cases that were found to be fraudulent, that is, all cases that had been tried and a conviction achieved. Let’s assume here that all individuals involved are good at their jobs and do not make arbitrary or poor decisions (which of course is also a problem!)

Let’s also put aside for a moment the time lag involved here (a problem itself) and just consider the conviction as a target variable. What does the target variable actually convey to us? Of course our desire is that this target variable conveys fraud risk. Certainly when the conviction has occurred, we have high confidence that the case was indeed fraudulent, so the “1”s are strong and clear labels for fraud.

But, what about the “0”s? Which cases do they include?
–cases never investigated (i.e., we suspect they are not fraud, but don’t know)
–cases assigned to a manager who never assigned the case (he/she didn’t think they were worth investigating).
–cases assigned to an investigator but the investigation has not yet been completed, or was never completed, or was determined not contain fraud
–cases that went to court but was found “not guilty”

Remember, all of these are given the identical label: “0”

That means that any cases that look on the surface to be fraudulent, but there were insufficient resources to investigate them, are called “not fraudulent. That means cases that were investigated but the investigator was taken off the case to investigate other cases are called “not fraudulent”. It means too that court cases that were thrown out of court due to a technicality unrelated to the fraud itself are called “not fraud”.

In other words, the target variable defined as only the “final conviction” represents not only the risk of fraud for a case, but also the investigation and legal system. Perhaps complex cases that are high risk are thrown out because they aren’t (at this particular time, with these particular investigators) worth the time. Is this what we want to predict? I would argue “no”. We want our target variable to represent the risk, not the system.

This is why when I work on fraud detection problems, the definition of the target variable takes time: we have to find measures that represent risk and are informative and consistent, but don’t measure the system itself. For different customers this means different trade-offs, but usually it means using a measure from earlier in the process.

So in summary, think carefully about the target variable you are defining, and don’t be surprised when your predictive models predict exactly what you told them to!

TAGGED:predictive modeling
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

student learning AI
Advanced Degrees Still Matter in an AI-Driven Job Market
Artificial Intelligence Exclusive
mobile device farm
How Mobile Device Farms Strengthen Big Data Workflows
Big Data Exclusive
composable analytics
How Composable Analytics Unlocks Modular Agility for Data Teams
Analytics Big Data Exclusive
fintech startups
Why Fintech Start-Ups Struggle To Secure The Funding They Need
Infographic News

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Predictive Analytics World New York City Conference Announces Speaker Line-Up

5 Min Read
Image
Big DataPredictive Analytics

How Predictive Modeling is Changing the Way We Work and Live

6 Min Read

Analytics In A Global Recession: Fixed Price Operational Dashboard

2 Min Read

The Commoditization of Analytics

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?