Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    big data analytics in transporation
    Turning Data Into Decisions: How Analytics Improves Transportation Strategy
    3 Min Read
    sales and data analytics
    How Data Analytics Improves Lead Management and Sales Results
    9 Min Read
    data analytics and truck accident claims
    How Data Analytics Reduces Truck Accidents and Speeds Up Claims
    7 Min Read
    predictive analytics for interior designers
    Interior Designers Boost Profits with Predictive Analytics
    8 Min Read
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Breakthrough: How to Avert Analytics’ Most Treacherous Pitfall
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > Breakthrough: How to Avert Analytics’ Most Treacherous Pitfall
Analytics

Breakthrough: How to Avert Analytics’ Most Treacherous Pitfall

EricSiegel
EricSiegel
11 Min Read
Image
SHARE

ImageThis article will make you feel better. And you do need to feel better, if you are one of the many of us who practice analytics—or who must consume and rely on analytics—and find ourselves carrying tension in our shoulders or sometimes losing sleep.

ImageThis article will make you feel better. And you do need to feel better, if you are one of the many of us who practice analytics—or who must consume and rely on analytics—and find ourselves carrying tension in our shoulders or sometimes losing sleep.

The fear stems from a well-known warning of tragic mishap: “If you torture the data long enough, it will confess,” as stated by University of Chicago economics professor Ronald Coase. There is a general sense that math could be wrong and that analytics is an art.

As John Elder of Elder Research put it, “It’s always possible to get lucky (or unlucky). When you mine data and find something, is it real, or chance?” How can we confidently trust what a computer claims to have learned? How do we avert the dire declension, “Lies, damned lies, and statistics”?

More Read

EMC Study: Data Scientists in Short Supply
Why Predicting the Future is So Darn Difficult
Integrating BPM Software Into Your Data Strategy
Big Data Analytics is Massively Disrupting the Legal Profession
Some thoughts on advanced analytics in 2010

There is a simple, elegant solution from Elder—but first, let me further magnify your fear: Even the very simplest predictive model risks utter failure. Mistaken, misleading conclusions are in fact terribly easy to come by.

A conclusion drawn about one single variable—even without the use of a common multivariate model (such as log-linear regression)—can go awry. In fact, one of the more famous such analytical insights, “an orange used car is least likely to be a lemon,” has recently been debunked by Elder and his colleague Ben Bullard at Elder Research, Inc.

Big data, with all its pomp and circumstance, can actually mean big risk. More data can present more opportunities to inadvertently discover untrue patterns that appear misleadingly strong within your dataset—but, in fact, do not hold true in general. To be more specific, “bigger” data could mean longer data (a longer list of examples, which generallyhelps avert spurious conclusions), but also could mean wider data (more columns—more variables/factors per example). So, even if you are only considering one variable at a time, such as the color of each car, you are more likely to come across one that just happens to look predictive in your data by sheer chance alone. This peril that arises when searching across many variables has been dubbed by John Elder vast search.

Dr. Elder puts it this way: “Modern predictive analytic algorithms are hypothesis-generating machines, capable of testing millions of ‘ideas.’ The best result stumbled upon in its vast search has a much greater chance of being spurious… The problem is so widespread that it is the chief reason for a crisis in experimental science, where most journal results have been discovered to resist replication; that is, to be wrong!”

A few years ago, Berkeley Professor David Leinweber made waves with his discovery that the annual closing price of the S&P 500 stock market index could have been predicted from 1983 to 1993 by the rate of butter production in Bangladesh. Bangladesh’s butter production mathematically explains 75 percent of the index’s variation over that time. Urgent calls were placed to the Credibility Police, since it certainly cannot be believed that Bangladesh’s butter is closely tied to the U.S. stock market. If its butter production boomed or went bust in any given year, how could it be reasonable to assume that U.S. stocks would follow suit? This stirred up the greatest fears of PA skeptics, and vindicated nonbelievers. Eyebrows were raised so vigorously, they catapulted Professor Leinweber onto national television.

Crackpot or legitimate educator? It turns out Leinweber had contrived this analysis as a playful publicity stunt, within a chapter entitled “Stupid Data Miner Tricks” in his book Nerds on Wall Street. His analysis was designed to highlight a common misstep by exaggerating it. It’s dangerously easy to find ridiculous correlations, especially when you’re “predicting” only 11 data points (annual index closings for 1983 to 1993). By searching through a large number of financial indicators across many countries, something or other will show similar trends, just by chance. It will eventually unearth cockamamie relationships. For example, shiver me timbers, a related study showed buried treasure discoveries in England and Wales predicted the Dow Jones Industrial Average a full year ahead from 1992 to 2002.

Leinweber attracted the attention he sought, but his lesson didn’t seem to sink in. “I got calls for years asking me what the current butter business in Bangladesh was looking like and I kept saying, ‘Ya know, it was a joke, it was a joke!’ It’s scary how few people actually get that.” As Black Swan author Nassim Taleb put it in his suitably titled book, Fooled by Randomness, “Nowhere is the problem of induction more relevant than in the world of trading—and nowhere has it been as ignored!” Thus the occasional overzealous yet earnest public claim of economic prediction based on factors like women’s hemlines, men’s necktie width, Super Bowl results, and Christmas day snowfall in Boston.

The culprit that kills machine learning is overlearning (akaoverfitting). Overlearning is the pitfall of mistaking noise for information, assuming too much about what has been shown within data. You’ve overlearned if you’ve read too much into the numbers, led astray from discovering the underlying truth.

While many analytics practitioners consider overlearning a risk with predictive models that combine multiple variables, the truth is even well-publicized single-variable results are at risk. A dire need for a new paradigm has emerged.

But is it really that hard? Why would analysts now assert that standard tests of statistical significance break down when vast search is in play?

And what can be done to validate (i.e., test for significance) even after vast search has claimed to have made a discovery?

Now that your interest has been piqued, you may get the answers from one or both of the following in-depth sources:

  1. PLENARY CONFERENCE SESSION. Presentation at six (6) Predictive Analytics World events in 2014: PAW San Francisco (March), PAW Toronto (May), PAW Chicago (June), PAW Government (September in DC), PAW Boston (October), and PAW London (October): “The Peril of Vast Search (and How Target Shuffling Can Save Science)” by John Elder, CEO & Founder, Elder Research, Inc. Full session description
  2. TECHNICAL PAPER: “Are Orange Cars Really not Lemons?” by Ben Bullard & John Elder, Elder Research, Inc. This technical paper explores the difficulty introduced above, walking the reader through a detailed example and introducing a solution for addressing the challenge at hand: target shuffling. Partial excerpt of the paper:

A recent article in The Seattle Times, reported that “an orange used car is least likely to be a lemon.” This discovery surfaced in a competition hosted by Kaggle to predict bad buys among used cars using a labeled dataset. Of the 72,983 used cars, 8,976 were bad buys (12.3%). Yet, of the 415 orange cars in the dataset, only 34 were bad (8.2%)…

But how unusual is this low proportion? That is, assuming the true proportion is really equal, what is the likelihood that it could have occurred by chance for a random partition of that size? Such a calculation takes into account the numbers of cars making up both proportions (good and bad Orange vs. good and bad non-Orange). When we apply a 1-sided statistical hypothesis test for equality of proportions between two samples it yields a p-value of 0.00675. In other words, the hypothesis test reveals that if the underlying reality is that the proportion of bad buys among orange cars is really equal to the proportion of bad buys among all non-orange cars, then the probability that one would observe a sample proportion for orange cars that is so much lower than the sample proportion for non-orange cars (given sample sizes of 415 and 72,466, respectively) is only 0.675%.

[…]

[But] what we see is that statistical hypothesis tests only work when the hypothesis comes first, and the analysis second. One cannot use the data to inform the hypothesis and then test that hypothesis on the same data. That leads to overfit and over-confidence in your results, which leads to the model underperforming (or failing entirely) on new data, where it is most needed.

And yet, how do we know what to hypothesize? Isn’t the great strength of data mining that the computer can try out all sorts of things are report back which one might work?

Access the full technical paper by Bullard and Elder

 

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

AI role in medical industry
The Role Of AI In Transforming Medical Manufacturing
Artificial Intelligence Exclusive
b2b sales
Unseen Barriers: Identifying Bottlenecks In B2B Sales
Business Rules Exclusive Infographic
data intelligence in healthcare
How Data Is Powering Real-Time Intelligence in Health Systems
Big Data Exclusive
intersection of data
The Intersection of Data and Empathy in Modern Support Careers
Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

First Look – Be Informed

6 Min Read

What Data Scientists Must Learn About Customers

0 Min Read
AI based penny stocks trading
AnalyticsArtificial IntelligenceBusiness Intelligence

Deep Learning Tools Could Compound Returns on Technical Analysis Trading

5 Min Read

Voice of the (Vocal Few) Customer(s)

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence
ai chatbot
The Art of Conversation: Enhancing Chatbots with Advanced AI Prompts
Chatbots

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?