Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics and truck accident claims
    How Data Analytics Reduces Truck Accidents and Speeds Up Claims
    7 Min Read
    predictive analytics for interior designers
    Interior Designers Boost Profits with Predictive Analytics
    8 Min Read
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Breakthrough: How to Avert Analytics’ Most Treacherous Pitfall
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > Breakthrough: How to Avert Analytics’ Most Treacherous Pitfall
Analytics

Breakthrough: How to Avert Analytics’ Most Treacherous Pitfall

EricSiegel
EricSiegel
11 Min Read
Image
SHARE

ImageThis article will make you feel better. And you do need to feel better, if you are one of the many of us who practice analytics—or who must consume and rely on analytics—and find ourselves carrying tension in our shoulders or sometimes losing sleep.

ImageThis article will make you feel better. And you do need to feel better, if you are one of the many of us who practice analytics—or who must consume and rely on analytics—and find ourselves carrying tension in our shoulders or sometimes losing sleep.

The fear stems from a well-known warning of tragic mishap: “If you torture the data long enough, it will confess,” as stated by University of Chicago economics professor Ronald Coase. There is a general sense that math could be wrong and that analytics is an art.

As John Elder of Elder Research put it, “It’s always possible to get lucky (or unlucky). When you mine data and find something, is it real, or chance?” How can we confidently trust what a computer claims to have learned? How do we avert the dire declension, “Lies, damned lies, and statistics”?

More Read

excel model failure
Don’t Gloat Over Excel Model Failures
Furthering Big Data’s Retail Benefits
The Big Deal is in the 2013 Business Analytics Research Agenda
Google Maps Transit System Layer
Share the Love… of Data Quality

There is a simple, elegant solution from Elder—but first, let me further magnify your fear: Even the very simplest predictive model risks utter failure. Mistaken, misleading conclusions are in fact terribly easy to come by.

A conclusion drawn about one single variable—even without the use of a common multivariate model (such as log-linear regression)—can go awry. In fact, one of the more famous such analytical insights, “an orange used car is least likely to be a lemon,” has recently been debunked by Elder and his colleague Ben Bullard at Elder Research, Inc.

Big data, with all its pomp and circumstance, can actually mean big risk. More data can present more opportunities to inadvertently discover untrue patterns that appear misleadingly strong within your dataset—but, in fact, do not hold true in general. To be more specific, “bigger” data could mean longer data (a longer list of examples, which generallyhelps avert spurious conclusions), but also could mean wider data (more columns—more variables/factors per example). So, even if you are only considering one variable at a time, such as the color of each car, you are more likely to come across one that just happens to look predictive in your data by sheer chance alone. This peril that arises when searching across many variables has been dubbed by John Elder vast search.

Dr. Elder puts it this way: “Modern predictive analytic algorithms are hypothesis-generating machines, capable of testing millions of ‘ideas.’ The best result stumbled upon in its vast search has a much greater chance of being spurious… The problem is so widespread that it is the chief reason for a crisis in experimental science, where most journal results have been discovered to resist replication; that is, to be wrong!”

A few years ago, Berkeley Professor David Leinweber made waves with his discovery that the annual closing price of the S&P 500 stock market index could have been predicted from 1983 to 1993 by the rate of butter production in Bangladesh. Bangladesh’s butter production mathematically explains 75 percent of the index’s variation over that time. Urgent calls were placed to the Credibility Police, since it certainly cannot be believed that Bangladesh’s butter is closely tied to the U.S. stock market. If its butter production boomed or went bust in any given year, how could it be reasonable to assume that U.S. stocks would follow suit? This stirred up the greatest fears of PA skeptics, and vindicated nonbelievers. Eyebrows were raised so vigorously, they catapulted Professor Leinweber onto national television.

Crackpot or legitimate educator? It turns out Leinweber had contrived this analysis as a playful publicity stunt, within a chapter entitled “Stupid Data Miner Tricks” in his book Nerds on Wall Street. His analysis was designed to highlight a common misstep by exaggerating it. It’s dangerously easy to find ridiculous correlations, especially when you’re “predicting” only 11 data points (annual index closings for 1983 to 1993). By searching through a large number of financial indicators across many countries, something or other will show similar trends, just by chance. It will eventually unearth cockamamie relationships. For example, shiver me timbers, a related study showed buried treasure discoveries in England and Wales predicted the Dow Jones Industrial Average a full year ahead from 1992 to 2002.

Leinweber attracted the attention he sought, but his lesson didn’t seem to sink in. “I got calls for years asking me what the current butter business in Bangladesh was looking like and I kept saying, ‘Ya know, it was a joke, it was a joke!’ It’s scary how few people actually get that.” As Black Swan author Nassim Taleb put it in his suitably titled book, Fooled by Randomness, “Nowhere is the problem of induction more relevant than in the world of trading—and nowhere has it been as ignored!” Thus the occasional overzealous yet earnest public claim of economic prediction based on factors like women’s hemlines, men’s necktie width, Super Bowl results, and Christmas day snowfall in Boston.

The culprit that kills machine learning is overlearning (akaoverfitting). Overlearning is the pitfall of mistaking noise for information, assuming too much about what has been shown within data. You’ve overlearned if you’ve read too much into the numbers, led astray from discovering the underlying truth.

While many analytics practitioners consider overlearning a risk with predictive models that combine multiple variables, the truth is even well-publicized single-variable results are at risk. A dire need for a new paradigm has emerged.

But is it really that hard? Why would analysts now assert that standard tests of statistical significance break down when vast search is in play?

And what can be done to validate (i.e., test for significance) even after vast search has claimed to have made a discovery?

Now that your interest has been piqued, you may get the answers from one or both of the following in-depth sources:

  1. PLENARY CONFERENCE SESSION. Presentation at six (6) Predictive Analytics World events in 2014: PAW San Francisco (March), PAW Toronto (May), PAW Chicago (June), PAW Government (September in DC), PAW Boston (October), and PAW London (October): “The Peril of Vast Search (and How Target Shuffling Can Save Science)” by John Elder, CEO & Founder, Elder Research, Inc. Full session description
  2. TECHNICAL PAPER: “Are Orange Cars Really not Lemons?” by Ben Bullard & John Elder, Elder Research, Inc. This technical paper explores the difficulty introduced above, walking the reader through a detailed example and introducing a solution for addressing the challenge at hand: target shuffling. Partial excerpt of the paper:

A recent article in The Seattle Times, reported that “an orange used car is least likely to be a lemon.” This discovery surfaced in a competition hosted by Kaggle to predict bad buys among used cars using a labeled dataset. Of the 72,983 used cars, 8,976 were bad buys (12.3%). Yet, of the 415 orange cars in the dataset, only 34 were bad (8.2%)…

But how unusual is this low proportion? That is, assuming the true proportion is really equal, what is the likelihood that it could have occurred by chance for a random partition of that size? Such a calculation takes into account the numbers of cars making up both proportions (good and bad Orange vs. good and bad non-Orange). When we apply a 1-sided statistical hypothesis test for equality of proportions between two samples it yields a p-value of 0.00675. In other words, the hypothesis test reveals that if the underlying reality is that the proportion of bad buys among orange cars is really equal to the proportion of bad buys among all non-orange cars, then the probability that one would observe a sample proportion for orange cars that is so much lower than the sample proportion for non-orange cars (given sample sizes of 415 and 72,466, respectively) is only 0.675%.

[…]

[But] what we see is that statistical hypothesis tests only work when the hypothesis comes first, and the analysis second. One cannot use the data to inform the hypothesis and then test that hypothesis on the same data. That leads to overfit and over-confidence in your results, which leads to the model underperforming (or failing entirely) on new data, where it is most needed.

And yet, how do we know what to hypothesize? Isn’t the great strength of data mining that the computer can try out all sorts of things are report back which one might work?

Access the full technical paper by Bullard and Elder

 

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

data analytics and truck accident claims
How Data Analytics Reduces Truck Accidents and Speeds Up Claims
Analytics Big Data Exclusive
predictive analytics for interior designers
Interior Designers Boost Profits with Predictive Analytics
Analytics Exclusive Predictive Analytics
big data and cybercrime
Stopping Lateral Movement in a Data-Heavy, Edge-First World
Big Data Exclusive
AI and data mining
What the Rise of AI Web Scrapers Means for Data Teams
Artificial Intelligence Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

big data analytics
AnalyticsExclusiveMarketingNewsPredictive Analytics

How to Use Analytics for Effective Content Marketing

6 Min Read

IBM’s recent campaign goes well beyond mere image — and…

1 Min Read
Big data analytics
AnalyticsBig Data

How The Online Gaming Industry Uses Big Data Analytics To Grow

9 Min Read

From Master Data to Master Graph

16 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?