Poor Quality Data Sucks

Contents

Baseball and Data Quality

Baseball and Data Quality
The Drunkard’s Walk
Conclusion

Books Recommended by Red Sox Nation

Over the last few months on his Information Management blog, Steve Miller has been writing posts inspired by a great 2008 book that we both highly recommend: The Drunkard’s Walk: How Randomness Rules Our Lives by Leonard Mlodinow.

In his most recent post The Demise of the 2009 Boston Red Sox: Super-Crunching Takes a Drunkard’s Walk, Miller takes on my beloved Boston Red Sox and the less-than-glorious conclusion to their 2009 season.

For those readers who are not baseball fans, the Los Angeles Angels of Anaheim swept the Red Sox out of the playoffs. I will let Miller’s words describe their demise: “Down two to none in the best of five series, the Red Sox took a 6-4 lead into the ninth inning, turning control over to impenetrable closer Jonathan Papelbon, who hadn’t allowed a run in 26 postseason innings. The Angels, within one strike of defeat on three occasions, somehow managed a miracle rally, scoring 3 runs to take the lead 7-6, then holding off the Red Sox in the bottom of the ninth for the victory to complete the shocking sweep.”

Baseball and Data Quality

What, you may be asking, does baseball have to do with data quality? Beyond simply being two of my all-time …

Baseball and Data Quality

What, you may be asking, does baseball have to do with data quality? Beyond simply being two of my all-time favorite topics, quite a lot actually. Baseball data is mostly transaction data describing the statistical events of games played.

Statistical analysis has been a beloved pastime even longer than baseball has been America’s Pastime. Number-crunching is far more than just a quantitative exercise in counting. The qualitative component of statistics – discerning what the numbers mean, analyzing them to discover predictive patterns and trends – is the very basis of data-driven decision making.

“The Red Sox,” as Miller explained, “are certainly exemplars of the data and analytic team-building methodology” chronicled in Moneyball: The Art of Winning an Unfair Game, the 2003 book by Michael Lewis. Red Sox General Manager Theo Epstein has always been an advocate of the so-called evidenced-based baseball, or baseball analytics, pioneered by Bill James, the baseball writer, historian, statistician, current Red Sox consultant, and founder of Sabermetrics.

In another book that Miller and I both highly recommend, Super Crunchers, author Ian Ayres explained that “Bill James challenged the notion that baseball experts could judge talent simply by watching a player. James’s simple but powerful thesis was that data-based analysis in baseball was superior to observational expertise. James’s number-crunching approach was particular anathema to scouts.”

“James was baseball’s herald,” continues Ayres, “of data-driven decision making.”

The Drunkard’s Walk

As Mlodinow explains in the prologue: “The title The Drunkard’s Walk comes from a mathematical term describing random motion, such as the paths molecules follow as they fly through space, incessantly bumping, and being bumped by, their sister molecules. The surprise is that the tools used to understand the drunkard’s walk can also be employed to help understand the events of everyday life.”

Later in the book, Mlodinow describes the hidden effects of randomness by discussing how to build a mathematical model for the probability that a baseball player will hit a home run: “The result of any particular at bat depends on the player’s ability, of course. But it also depends on the interplay of many other factors: his health, the wind, the sun or the stadium lights, the quality of the pitches he receives, the game situation, whether he correctly guesses how the pitcher will throw, whether his hand-eye coordination works just perfectly as he takes his swing, whether that brunette he met at the bar kept him up too late, or the chili-cheese dog with garlic fries he had for breakfast soured his stomach.”

“If not for all the unpredictable factors,” continues Mlodinow, “a player would either hit a home run on every at bat or fail to do so. Instead, for each at bat all you can say is that he has a certain probability of hitting a home run and a certain probability of failing to hit one. Over the hundreds of at bats he has each year, those random factors usually average out and result in some typical home run production that increases as the player becomes more skillful and then eventually decreases owing to the same process that etches wrinkles in his handsome face. But sometimes the random factors don’t average out. How often does that happen, and how large is the aberration?”

Conclusion

I have heard some (not Mlodinow or anyone else mentioned in this post) argue that data quality is an irrelevant issue. The basis of their argument is that poor quality data are simply random factors that, in any data set of statistically significant size, will usually average out and therefore have a negligible effect on any data-based decisions.

However, the random factors don’t always average out. It is important to not only measure exactly how often poor quality data occur, but acknowledge the large aberration poor quality data are, especially in data-driven decision making.

As every citizen of Red Sox Nation is taught from birth, the only acceptable opinion of our American League East Division rivals, the New York Yankees, is encapsulated in the chant heard throughout the baseball season (and not just at Fenway Park):

“Yankees Suck!”

From their inception, the day-to-day business decisions of every organization are based on its data. This decision-critical information drives the operational, tactical, and strategic initiatives essential to the enterprise’s mission to survive and thrive in today’s highly competitive and rapidly evolving marketplace.

It doesn’t quite roll off the tongue as easily, but a chant heard throughout these enterprise information initiatives is: