Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics for pharmacy trends
    How Data Analytics Is Tracking Trends in the Pharmacy Industry
    5 Min Read
    car expense data analytics
    Data Analytics for Smarter Vehicle Expense Management
    10 Min Read
    image fx (60)
    Data Analytics Driving the Modern E-commerce Warehouse
    13 Min Read
    big data analytics in transporation
    Turning Data Into Decisions: How Analytics Improves Transportation Strategy
    3 Min Read
    sales and data analytics
    How Data Analytics Improves Lead Management and Sales Results
    9 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Stratified Sampling vs. Posterior Probability Thresholds
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Stratified Sampling vs. Posterior Probability Thresholds
Data MiningPredictive Analytics

Stratified Sampling vs. Posterior Probability Thresholds

DeanAbbott
DeanAbbott
7 Min Read
SHARE

One of the great things about conference like the recent Predictive Analytics World is how many technical interactions one has with top practitioners; this past October was no exception. One such interaction was with Tim Manns who blogs here. We were talking about Clementine and what to do with small populations of 1s in the target variable, which prompted me to jump onto my soapbox with an issue that I had never read about, but which occurs commonly in data mining problems such as response modeling and fraud detection.

The setup goes something like this: you have 1% responders, you build models, and the model “says” every record is a 0. My explanation for this was always that errors in classification models take place when the same pattern of inputs can produce both outcomes. In this situation, what is the best guess? The most commonly occurring output variable value. If you have 99% 0s, that is most likely a 0, and therefore data mining tools will produce the answer “0”. The common solution to this is to resample the data (stratify) so that one has equal numbers of 0s and 1s in the data, and then rebuild the model. While this is true, it misses an important factor.

I can’t claim . …



One of the great things about conference like the recent Predictive Analytics World is how many technical interactions one has with top practitioners; this past October was no exception. One such interaction was with Tim Manns who blogs here. We were talking about Clementine and what to do with small populations of 1s in the target variable, which prompted me to jump onto my soapbox with an issue that I had never read about, but which occurs commonly in data mining problems such as response modeling and fraud detection.

More Read

Image
What Really Is Big Data? And Why It Will Change the World
Are You Watching the Watchers?
MicroStrain continues its winning streak with its Shear-Link…
Using Web 2.0 for Analytics 2.0
elsua: “An introduction to the social ‘stack’” by the ever…

The setup goes something like this: you have 1% responders, you build models, and the model “says” every record is a 0. My explanation for this was always that errors in classification models take place when the same pattern of inputs can produce both outcomes. In this situation, what is the best guess? The most commonly occurring output variable value. If you have 99% 0s, that is most likely a 0, and therefore data mining tools will produce the answer “0”. The common solution to this is to resample the data (stratify) so that one has equal numbers of 0s and 1s in the data, and then rebuild the model. While this is true, it misses an important factor.

I can’t claim credit for this (thanks Marie!). I was working on a consulting project with a statistician, and when we were building logistic regression models, I recommended resampling so we don’t have the “model calls everything a 0” problem. She seemed puzzled by this, and asked why not threshold at the prior probability level. It was clear right away that this is true, and I’ve been doing it ever since (with logistic regression or neural networks in particular).

What was she saying? First, it needs to be stated that no algorithm produces “decisions.” Logistic regression produces probabilities. Neural networks produce confidence values (though I just had a conversation with one of the smartest machine learning guys I know who talked about neural networks producing true probabilities — maybe I’ll blog on this more another time). The decisions that one sees (“all records are called 0s”) are produced by the software, interpreting the probabilities or confidence values by thresholding them at 0.5. It is always 0.5. I don’t think I’ve ever found a data mining software package that doesn’t threshold at 0.5, in fact. So the software expects the prior probabilities of 0s and 1s to be equal. When they are not (like with 99% 0s and 1% 1s), this threshold is completely inappropriate; the center of density of the distribution of probabilities will center roughly on the prior probability level (0.01 for the 1% response rate problem). I show some examples of this in my data mining course that makes this more clear.

So what can one do? If one thresholds at 0.01 rather than 0.5, one gets a nice confusion matrix out of the classification problem. Of course if you use a ROC curve, Lift Chart or Gains Chart to assess your model, you don’t worry about thresholding anyway.

Which brings me to the conversation with Tim Manns. I’m glad he tried it out himself, though I don’t think one has to make the target variable continuous to make this work. Tim did his testing in Clementine, but the same holds for any other data mining software tool. What Tim’s trick does is correct: if you make the [0,1] target variable numeric, you can build a neural network just fine and the predicted value is “exposed.” In Clementine, if you keep it as a “flag” variable, you would threshold the propensity value ($NRP-target).

So, read Tim’s post (and his other posts!). This trick can be used with nearly any tool — I’ve done it with Matlab and Tibco Spotfire Miner, among others).

Now, if tools would only include an option to threshold the propensity at 0.5 or the prior probability (or more precisely, the proportion in the training data).

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

dedicated servers for ai businesses
5 Reasons AI-Driven Business Need Dedicated Servers
Artificial Intelligence Exclusive News
data analytics for pharmacy trends
How Data Analytics Is Tracking Trends in the Pharmacy Industry
Analytics Big Data Exclusive
ai call centers
Using Generative AI Call Center Solutions to Improve Agent Productivity
Artificial Intelligence Exclusive
warehousing in the age of big data
Top Challenges Of Product Warehousing In The Age Of Big Data
Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

BI Meets Machine Learning and the Internet of Me

10 Min Read

An Introduction to ElasticSearch

4 Min Read
predictive analytics in content creation
AnalyticsExclusivePredictive Analytics

Is Predictive Analytics Solving Challenges In Content Creation?

9 Min Read

The Next Big Thing is REALLY BIG: Interactions Versus Transactions

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?