The “decline effect,” random variation, and evidence-based marketing

There’s an interesting article by Jonah Lehrer in the Dec.

There’s an interesting article by Jonah Lehrer in the Dec. 13 issue of The New Yorker, “The Truth Wears Off:  Is there something wrong with the scientific method?” Lehrer reports that a growing number of scientists are concerned about what psychologist Joseph Banks Rhine termed the “decline effect.”  In a nutshell, the “decline effect” is an observed tendency for the size of an observed effect to decline over the course of studies attempting to replicate that effect. 

Lehrer cites examples from studies of the clinical outcomes for a class of once-promising antipsychotic drugs as well as from more theoretical research.  This is a scary situation given the inferential nature of most scientific research.  Each set of observations represents an opportunity to disconfirm a hypothesis.  As long as subsequent observations don’t lead to disconfirmation, our confidence in the hypothesis grows.  The decline effect suggests that replication is more likely, over time, to disconfirm a hypothesis than not.  Under those circumstances, it’s hard to develop sound theory.

Given that market researchers apply much of the same reasoning as scientists in deciding what’s an effect and what isn’t, the decline effect is a serious threat to creating customer knowledge and making evidence-based marketing decisions.

Lehrer suggests that the decline effect is a consequence of publication bias.  Only studies that demonstrate some noticeable effect (usually consistent with a prior hypothesis) make the cut for publication.  Our beliefs about randomness and the likelihood of large effects have lulled us into a false confidence about the validity or reliability of study results.  A researcher observes a large effect, that study gets published and, in many cases that is the end of it.  The researcher moves on to another topic.  Unless there are multiple attempts at replication, we cannot be certain (even with statistical “significance” testing) that the big effect was not a random outlier.

Before you dismiss this idea, consider two empirical examples.  First is Lehrer’s account of an experiment conducted in the late nineteen-nineties by John Crabbe, a neuroscientist at the Oregon Health and Science University.  Crabbe replicated a study on mouse behavior in three geographically dispersed labs, but he made sure that all other aspects of the experiment were “identical” across the three labs down to the smallest detail, such as the day on which the mice were shipped to the labs (and all from the same supplier and genetic strain, of course).  The study measured the effect of cocaine on rat movement.   In one lab, the rats moved, on average, 600 centimenters more than their baseline after being injected with cocaine.  In the second lab the moved an average of 701 additional centimeters.  In the third lab they moved an additional 5,000 centimeters!  Three experiments, as identical as possible, yielded effects that varied by more than eightfold.

My second example is closer to the core of market research.  With access to a sample of 2,000 online survey respondents, I conducted a quasi-bootstrap analysis to determine just how much sampling variation I might see with repeated random samples drawn from this larger “population.”  I created 20 separate random samples with 500 respondents in each sample and looked at the “sampling” distribution of a few key measures from the survey.  It’s a simple matter to calculate a 95% confidence interval for each of the measures.  Over the long run (many thousands of replications) we would expect that no more than 5% of the samples would fall outside the 95% confidence interval for any metric.  With only 20 samples, it’s reasonable to expect that 10% or 15% (2 or 3 of the samples) would fall outside that range.  Across the various measures, I found 30%, 45%, 25%, and 50% of the estimates fell outside that 95% confidence interval.  (By the way, I repeated this exercise with a different sample of online survey respondents and found 20% to 40% of the values lying outside the 95% confidence intervals.)

Many “evidence-based” marketing decisions are informed by facts and insights obtained from one-off samples of (at best) a few hundred “representative” consumers.  We calculate confidence intervals for point estimates or apply “appropriate” tests, such as Student’s T, for comparing subgroups on some measure of interest, and as long as the calculations inform us that a confidence interval is sufficiently narrow or the difference in the means of two groups is sufficiently large, we accept the result without much question.  Logically, these tests tell us something about the sample but may be wildly misleading with respect to inferences about a larger population of interest. 

Online samples–whether from “opt-in” panels or the wide wide river of the Internet–have raised (properly) several concerns about the quality of the evidence on which we hope to base marketing decisions.  As proplems of “respondent quality” have been addressed by digital fingerprinting and other verifiation solutions, attention has turned to measurement consistency between samples (The Grand Mean Project, is one example).  After all, if you cannot get the same answer when you measure the same thing in the same way on different occasions, how much faith can you have in online survey research?

A focus on consistency is important but needs to be combined with a better understanding of how much consistency we should expect in the first place.  Bootstrapping analysis is a good place to start.  And, once we have a better understanding of true sampling error, we can develop decision strategies that better reflect and incorporate our uncertainty about the “evidence.”

Copyright 2011 by David B. Bakken.  All rights reserved.