Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics for pharmacy trends
    How Data Analytics Is Tracking Trends in the Pharmacy Industry
    5 Min Read
    car expense data analytics
    Data Analytics for Smarter Vehicle Expense Management
    10 Min Read
    image fx (60)
    Data Analytics Driving the Modern E-commerce Warehouse
    13 Min Read
    big data analytics in transporation
    Turning Data Into Decisions: How Analytics Improves Transportation Strategy
    3 Min Read
    sales and data analytics
    How Data Analytics Improves Lead Management and Sales Results
    9 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Why Predictive Modelers Should be Suspicious of Statistical Tests
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > Modeling > Why Predictive Modelers Should be Suspicious of Statistical Tests
ModelingPredictive Analytics

Why Predictive Modelers Should be Suspicious of Statistical Tests

DeanAbbott
DeanAbbott
7 Min Read
SHARE

Well, the danger is really not the statistical test per se, it the interpretation of the statistical test.

Yesterday I tweeted (@deanabb) this fun factoid: “Redskins predict Romney wins POTUS #overfit. if Redskins lose home game before election => challenger wins (17/18) http://www.usatoday.com/story/gameon/2012/11/04/nfl-redskins-rule-romney/1681023/” I frankly had never heard of this “rule” before and found it quite striking. It even has its own Wikipedia page (http://en.wikipedia.org/wiki/Redskins_Rule).

Well, the danger is really not the statistical test per se, it the interpretation of the statistical test.

Yesterday I tweeted (@deanabb) this fun factoid: “Redskins predict Romney wins POTUS #overfit. if Redskins lose home game before election => challenger wins (17/18) http://www.usatoday.com/story/gameon/2012/11/04/nfl-redskins-rule-romney/1681023/” I frankly had never heard of this “rule” before and found it quite striking. It even has its own Wikipedia page (http://en.wikipedia.org/wiki/Redskins_Rule).

More Read

With over 30 shopping-related APIs and 300+ mashups tagged…
Prescriptive Analytics – A Step Beyond Predictive Analytics
One oil field alone can generate the equivalent of 200…
Data Tracking for Asthma Sufferers?
Measuring the Strong Signal of the Customer’s Voice

For those of us in the predictive analytics or data mining community, and those of us who use statistical tests to help out interpreting small data, 17/18 we know is a hugely significant finding. This can frequently be good: statistical tests will help us gain intuition about value of relationships in data even when they aren’t obvious.

In this case, an appropriate test is a chi-square test based on the two binary variables (1) did the Redskins win on the Sunday before the general election (call it the input or predictor variable) vs. (2) did the incumbent political party win the general election for President of the United States (POTUS).

According to the Redskins Rule, the answer is “yes” in 17 of 18 cases since 1940. Could this be by chance? If we apply the chi-square test to it, it sure does look significant! (chi-square = 14.4, p < 0.001). I like the decision tree representation of this that shows how significant it is (built using the Interactive CHAID tree in IBM Modeler on Redskin Rule data I put together here):

It’s great data–9 Redskin wins, 9 Redskin losses, great chi-square statistic!

OK, so it’s obvious that this is just another spurious correlation in the spirit of all of those fun examples in history, such as the superbowl winning conference predicting if the stock market would go up or down in the next year at a stunning 20 or 22 correct. It even was the subject of academic papers on the subject!

The broader question (and concern) for predictive modelers is this: how do we recognize when we have uncovered spurious correlations in the data that are merely spurious? This can happen especially when we don’t have deep domain knowledge and therefore wouldn’t necessarily identify variables or interactions as spurious. In examples such as the election or stock market predictions, no amount of “hold out” samples, cross-validation or bootstrap sampling would uncover the problem: it is in the data itself.

We need to think about this because inductive learning techniques search through hundreds, thousands, even millions of variables and combinations of variables. The phenomenon of “over searching” is a real danger with inductive algorithms as they search and search for patterns in the input space. Jensen and Cohen have a very nice and readable paper on this topic (PDF here). For trees, they recommend using the Bonferroni adjustment which does help penalize the combinatorics associated with splits. But our problem here goes far deeper than overfitting due to combinatorics.

Of course the root problem with all of these spurious correlations is small data. Even if we have lots of data, what I’ll call here the “illusion of big data”, some algorithms make decisions based on smaller populations, like decision trees, rule induction and nearest neighbor (i.e., algorithms that build bottom-up). Anytime decisions are made from populations of 15, 20, 30 or even 50 examples, there is a danger that our search through hundreds of variables will turn out a spurious relationship.

What do we do about this? First, make sure you have enough data so that these small-data effects don’t bite you. This is why I strongly recommend doing data audits and looking for categorical variables that contain levels with at most dozens of examples–these are potential overfilling categories.

Second, don’t hold strongly any patterns discovered in your data based on solely on the data, especially if they are based on relatively small sample sizes. These must be validated with domain experts. Decision trees are notorious for allowing splits deep in the trees that are “statistically significant” but dangerous nevertheless because of small data sizes.

Third, the gist of your models have to make sense. If they don’t, put on your “Freakonomics” hat and dig in to understand why the patterns were detected by the models. In our Redskin Rule, clearly this doesn’t make sense causally, but sometimes the pattern picked up by the algorithm is just a surrogate for a real relationship. Nevertheless, I’m still curious to see if the Redskin Rule will prove to be correct once again. This year it predicts a Romney win because the Redskins lost and therefore the incumbent party (D) by the rule should lose. UPDATE: by way of comparison…the chances of having 17/18 or 18/18 coin flips turn up heads (or tails–we’re assuming a fair coin after all!) is 7 in 100,000 or 1 in 14,000. Put another way, if we examined 14K candidate variables unrelated to POTUS trends, the chances are that one of them would line up 17/18 or 18/18 of the time. Unusual? Yes. Impossible? No!

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

cybersecurity essentials
Cybersecurity Essentials For Customer-Facing Platforms
Exclusive Infographic IT Security
ai for making lyric videos
How AI Is Revolutionizing Lyric Video Creation
Artificial Intelligence Exclusive
intersection of data and patient care
How Healthcare Careers Are Expanding at the Intersection of Data and Patient Care
Big Data Exclusive
dedicated servers for ai businesses
5 Reasons AI-Driven Business Need Dedicated Servers
Artificial Intelligence Exclusive News

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Guest Blogger: Len Tashman Previews Fall 2012 Issue of Foresight

3 Min Read

Quick Visualization of irs.gov Search Queries

3 Min Read

Gaining an ‘Unfair Advantage’ with Predictive Analytics

3 Min Read

Hospitality Technology (Or Lack Thereof) – What is the Insight ROI?

3 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?