Gregory Piatetsky-Shapiro, PhD (@KDNuggets) is the president of KDnuggets, which provides consulting in the areas of business analytics, data mining, data science, and knowledge discovery. Previously, he led data mining and consulting groups at GTE Laboratories, Knowledge Stream Partners, and Xchange. He has extensive experience developing CRM, customer attrition, cross-sell, segmentation and other models for some of the leading banks, insurance companies, and telecommunications companies.
He is also the editor and publisher of KDnuggets News, the leading newsletter on data mining and knowledge discovery (published since 1993), and the KDnuggets.com website (published since 1997), a top resource for data mining and analytics news, software, jobs, courses, data, education, and more. Gregory has over 60 publications, with over 10,000 citations, including two best-selling books and several edited collections on topics related to data mining and knowledge discovery.
Q − You recently said on Twitter that detective Sherlock Holmes would have been “a good data scientist.” What skills do you think he possessed that anyone can use to discover the unknowns in data?
A −I read most of Arthur Conan Doyle books on Sherlock Holmes as a kid, and loved his powers of deduction. Holmes was a keen observer of facts and had good logic. He also was on the cutting edge of the science in his time; so today’s Sherlock Holmes would be analyzing social networks of the suspects (and perhaps hacking them) in addition to looking for fingerprints. Finally, he had great intuition and knew how to reject wrong hypotheses. No matter how initially appealing they appeared, if these hypotheses were not supported by facts, he rejected them.
Q − Holmes was known for seeing “hidden” information and identifying opportunities and threats. In the most recent film, the viewer was given a glimpse of his method of calculating the next move before it happened in sort of a flash-forward scene. Do you think today’s predictive analytics technologies and data visualizations give us this “extra sense” that Holmes had? What does this do for companies in competitive situations?
A − Of course, analytics is much easier when we predict behavior of inanimate objects – asteroids, viruses, etc. Predicting human behavior is much more difficult. Predictive analytics today gives us the ability to predict future behavior, but for non-trivial predictions they are only accurate in aggregate, not for individuals.
For example, say Verizon’s monthly customer churn rate is 2%. That means that every month 2 out of 100 customers switch. With analytics, we can select a group of customers where churning is 5-7 times more frequent. The difference is that the expected churn rate in the analytics-selected list can be 10%-14% versus 2% in a random list. Of the 100 selected customers we expect only 10-14 to churn.
This has enormous business value because these customers can be contacted much more efficiently and personalized offers can be made to retain them. However, this example shows the limits of analytics in predicting human behavior “in aggregate” – it is far from perfect. But it does NOT need to be perfect to be useful.
Analytics can be used when there are many thousands of similar customers. When we have a single person, statistical methods are not relevant and need to be augmented with knowledge and rules. Ultimately, a combination of artificial intelligence and statistical methods can be a powerful predictor.
Q − You also said recently on Twitter that you think the “data scientist” title is provocative but misleading. You suggested data engineer as a description. Can you distinguish the roles for us and give us a “for instance” of when a company would involve a data engineer or scientist?
I am not saying “data scientist” is provocative – I think it is a great title for marketing and branding. But the people hired in industry under this title do very applied things – a combination of code tuning, business insight, and knowing how to extract business value from data. This is very applied and a more accurate (if less sexy) description is “data engineer” (or perhaps “data wizard” or “data wrangler”).
People who do actual science and publish in top journals or conferences almost never use the “data scientist” title. They more frequently use “knowledge discovery” or “data mining” in their position or lab description. For example, in our academic and research positions listings, there are over a dozen positions with “data mining” and only one with “data scientist” in the title.
Q − In your Analytics and Data Mining Industry Overview presentation, you said that big data is a second industrial revolution and that it will help us do old things better like churn prediction, customer modeling, make recommendations, detect fraud and security/intelligence. Where do you see these “old things” changing the most (e.g. industries or markets)? What lessons do we still have to learn?
We already see huge changes in advertising and behavioral targeting, and most consumer-oriented business (retail, telecom, etc.), are actively leveraging big data to get more value from the customers. However, sometimes the increased accuracy may backfire as in the recent story of the Target retail chain using changes in buying behavior to learn that a teen was pregnant before her father did. I don’t know exactly where the line is that Target has crossed, but as a society we will have to adjust to our changing privacy in the era of big data.
Q − You have a storied background in “knowledge discovery” from developing models to predicting child support non-payment and attrition to predicting drug effectiveness to developing methods for identifying online auction fraud. How can companies capitalize on knowledge discovery? How do you feel the term knowledge discovery applies to “predictive analytics” today?
I coined the term “knowledge discovery” to emphasize that what we want to find in data is some understandable knowledge and not just incomprehensible patterns. Much of the big data analysis and many of the best predictive algorithms, unfortunately, produce results that are incomprehensible. So we have a tension between accuracy and understandability, but I think that better understanding of predictive models will contribute to increased trust for such models, and may also help resolve some privacy issues by providing increased transparency.
Spotfire Blogging Team