The First Data Scientist on the Evolution of Data Science
Norman Nie was not surprised by the accurate predictions of the presidential election results from Nate Silver and others. “A lot of it,” he told me recently, “is good statistics and good science and good statistical programming packages.” The increasing amount of money spent by the media on polling, Nie believes, improved the accuracy of predictions by increasing the number of observations. In addition, with so much knowledge available now about every individual, the hypotheses and models used by the forecasters were developed on solid foundations. Says Nie: “40 years after The Changing American Voter, we really understand the voter’s decision.”
Nie is referring to the seminal work he (together with Sidney Verba and John Petrocik) published in 1976, itself a response to the landmark 1960 study The American Voter. While the latter described a passive electorate, unconcerned with political issues, and guided in its voting decisions primarily by party allegiance, Nie and his co-authors showed that contemporary voters had higher political awareness and were guided by their own position on the issues rather than political parties. Tracing the changes to the emergence of divisive issues and new voters, they outlined a new political landscape of issue-based factions and a more sophisticated electorate.
The two studies differed also in the tools they used to reach their conclusions. Working on his Ph.D. in political science at Stanford University in the mid-1960s, Nie became frustrated with the computer-based statistical analysis tools available at the time. They could not handle the amount of data that he wanted to analyze for his dissertation, which was part of a comparative study based on surveys in seven nations. In Participation and Political Equality (1978), the book summarizing the results of the study, Nie and his co-authors provided us with a glimpse at the big data challenges of the time: “If surfers travel the world to find that perfect wave, and mountain climbers do the same to climb the unclimbable, cross-national survey researchers, burdened with the immense data files, travel anywhere to find the cheaper computer.”
But it was not only the cost and limited capacity of the mainframe he was working with that frustrated Nie. The statistical analysis tools of the time were difficult to use, and the leading tool, BMD, was originally developed for bio-medical researchers, not for social scientists.
So Nie became the first “data scientist,” I would argue. The term has been used recently to describe the new generation of data massagers and miners, emerging first at Web-based companies that were, to borrow a phrase, “burdened with immense data files.” Nie and the solution to his frustration embodied the 1960s version of not only the challenge created by the growing volumes of data but all the other themes prevalent in today’s discussions of data science: The combination of software engineering, software design, and statistical skills, frequently achieved through team work; the importance of domain knowledge; the creativity and drive leading to new products and services for data analysis; and the scientific, empirical bent characterizing the work of finding new insights in data.
Once he decided to alleviate his frustration by developing a new data analysis tool, Nie has demonstrated the “soft skills” required of a data scientist. He convinced two other people with complementary skills to work with him: Dale Brent who was getting his Ph.D. in Operations Research at Stanford and Tex Hull, a top-notch programmer. Together, they invented the Statistical Package for the Social Sciences or SPSS.
As Nie sees it, his domain knowledge was instrumental in SPSS’ initial success. He was a social scientist first and foremost, developing a tool for other social scientists—easy to use interface, requiring little knowledge of programming, and including the most popular statistical analysis procedures. Designing SPSS from the point of view of “a social scientist looking at observations about people,” says Nie, helped also with the popularity of the software among data analysts outside of academia, where it initially spread by word-of-mouth. First, insurance companies started using it for mortality or risk analysis. They were joined by many other companies, from a variety of industries, all with the need to analyze data they have collected about the behavior and attitudes of their customers or any other relevant constituency. Of great help was the thick manual that came with the software, which established a new standard for software documentation. Again, domain knowledge was important–Nie wanted to have the manual used as a tool for teaching social science research methods, as he did at the University of Chicago where from 1968 to 1998 he was a professor of political science.
While domain knowledge may have been instrumental in launching SPSS, the same statistics and data analysis methods apply across many knowledge domains. Indeed, the two most successful statistical analysis programs, SPSS and SAS, expanded from their domain-specific base (social science for SPSS and bio-medical research for SAS) to become tools used wherever there was data to analyze. ”One of the interesting things about statistics is that the techniques you use tend to be very horizontal in terms of the application,” says Nie, and can be used in a wide variety of fields.
The common denominator for all the various applications and domains where SPSS and other statistical packages were applied, according to Nie, was that “hard data drives model building and model testing.” Empiricism became cool in the 1960s and 1970s and helped drive the widespread use of computer-based statistical tools. The growing complexity of the social and physical world gave rise to many new challenges and there was a growing realization, according to Nie, “that the best way to understand all of these problems is with empirical models.”
“Empirical model-building” is also how “data scientists approach the world” today, says Nie. But big data means that a lot has changed in the intervening years. Specifically, Nie argues, with more data and better tools—both more powerful computers and statistical analysis programs such as R—we have more sophisticated models. The limitation of the technologies of the past forced the use of limited-size samples and approximation methods. Today, says Nie, “we can move beyond linear approximation models” and achieve greater precision and accuracy in forecasts.
This new stage in the evolution of data science holds a lot of promise, but also requires people that could take advantage of the new technologies and techniques for data analysis. Nie cautions about what he says is “the part that’s a little scary,” the challenge of imparting domain knowledge to people trained in data science today. The early “data scientists” were people like him, with deep knowledge of a specific domain and the way questions were asked and answered in it and he thinks it’s difficult to capture it and incorporate it in a data scientist’s education.
The bigger challenge is the education of just about everybody else: “For several generations after World War II, we have told people that they can opt out very early from basic math training.” This has led to a crisis and the failure of the educational system to prepare students for today’s and tomorrow’s jobs. Nie’s advice? “To any entering undergraduate who says ‘What do I do in the American educational system to make sure I have a job when I get out?’ I would say take math and statistics and you’re guaranteed a job.”
Beyond providing more people with the opportunity of getting a good job, an investment in basic data science education can lead to more informed citizenry, making smarter decisions in a political environment characterized by increasing complexity and rising polarization. “Today we’ve become this incredibly polarized society that can’t agree on anything,” says Nie. “Any new issue that comes up is painted either red or blue. We went from high agreement, low partisanship, end-of-ideology at the end of World War II to this absolute bitter struggle between red and blue that we’re in now.”
Ever the inquisitive data scientist, Nie is analyzing survey data to figure out how it all came about. He hopes to publish the results of his research in a forthcoming book, tentatively titled The Ever-Changing American Voter.
Gil Press is Managing Partner of gPress, a marketing, publishing, and research consultancy. Previously, he held senior marketing and research management positions at NORC, DEC and EMC. Most recently, he was Senior Director, Thought Leadership Marketing at EMC, where he launched the Big Data conversation with the “How Much Information?” study (2000 and 2003 with UC Berkeley) and the ...