Big Data: Careful! When Correlations Go Crazy
In the world of big data, strange truths about the world begin to emerge. Orange cars are the most reliable used cars to buy. Prepaid phone card sales can predict unrest in Africa. And women with larger breasts spend more money online.
That last one comes from a recent study released by Alibaba, the Chinese website that hopes to be the next Amazon. Data analysts looking at data points for ladies’ underwear sales noticed that women who purchased larger bra sizes spent more online overall.
But is that knowledge useful? Maybe, maybe not.
Correlation does not equal causation.
If you ever took a science class in school, you might have heard the phrase, “Correlation does not equal causation.” It basically tells us that just because women who purchase larger sized bras spend more money online, that doesn’t mean that their larger bra size caused them to spend more money.
And that can be the problem when data analysts are looking at these strange and interesting new truths that emerge from the mass quantities of data to which we now have access. If we take it as true that orange used cars are more reliable, the question then becomes why: Are owners of orange cars more careful? Does the color prevent people from getting in accidents? Or does the color orange have some other magical property that keeps a car running well? The data has no answers.
Tyler Vigen posts funny charts to his website, Spurious Correlations, that show the danger of simply matching two data sets without any deeper understanding of how the things are related. For example, if correlation is all you need to go by, then we can assume that the more films Nicolas Cage appears in in any given year, the more swimming pool drownings will result and that an increase in U.S. spending on science results in an increase of suicides by hanging. Spurious indeed, we hope, or U.S. researchers and Nick Cage’s film career are in trouble.
The data-driven crystal ball.
Now that we have all this data, we’re just on the cusp of figuring out how to use it to our advantage. The goal is to be able to use these strange truths to try to predict everything from buying habits to the spread of the flu virus, and the results are just as varied.
Researchers have realized that Twitter updates can more quickly and more accurately predict flu outbreaks than traditional CDC tracking methods — in fact, Twitter data can predict an outbreak up to 8 days in advance with more than 90 percent accuracy.
The African company CellTel realized a similar prediction ability when it noticed an uptick in prepaid phone cards before major incidents of violence and unrest in Congo. They realized that the cards were denominated in U.S. dollars, and people bought them to have something portable and valuable to take with them and protect against local inflation.
Similarly, Alibaba hopes to use the incredible quantities of data it collects (as many as 14 million data points in a single day) to predict factors in a huge variety of businesses it may try.
“For example, if we have a lot of data on what people purchase in terms of food, groceries, is that data going to be helpful when we do healthcare? I think so,” an executive told online magazine Quartz.
As more companies try to use their data to predict consumer behavior, don’t be surprised to see more of these curious truths emerge. Facebook, of course, has an entire team dedicated to data science, and they frequently post their findings to their Facebook page, like the fact that if your name is Yvette, you are more than 37 percent more likely than the average person to have a sister named Yvonne.
How that helps Facebook’s business plan is yet to be seen.
As always, thank you very much for reading my posts. You might also be interested in my new book: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance
You can read a free sample chapter here.
For more, please check out my other posts in The Big Data Guru column and feel free to connect with me via:
You may be interested
The State of US Cyber Securitybcornell - May 25, 2017
During the first week of May 2017 President Donald Trump signed a cyber security executive order focusing on upgrading government…
Tips to keep your eCommerce Store Secured against HackersRehan Ijaz - May 25, 2017
“There are risks and costs to a program of action--but they are far less than the long-range cost of comfortable…
The Lessons We can Learn from Bad Data Mistakes Made Throughout HistoryMatthew Zajechowski - May 25, 2017
Bad data is costly. With data driving so many decisions in our lives, the cost of bad data truly impacts…