Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics
    How Data Analytics Can Help You Construct A Financial Weather Map
    4 Min Read
    financial analytics
    Financial Analytics Shows The Hidden Cost Of Not Switching Systems
    4 Min Read
    warehouse accidents
    Data Analytics and the Future of Warehouse Safety
    10 Min Read
    stock investing and data analytics
    How Data Analytics Supports Smarter Stock Trading Strategies
    4 Min Read
    predictive analytics risk management
    How Predictive Analytics Is Redefining Risk Management Across Industries
    7 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”
Big Data

The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”

hstevens
hstevens
7 Min Read
SHARE

Between 2007 and 2010, graduate students Erez Aiden, Jean-Baptiste Michel and Yuan Shen developed a tool for analyzing Google’s digitized books. Known as the N-gram viewer, their tool essentially just counted the number of times a word or phrase appeared in all the digitized publications for a given year.

Between 2007 and 2010, graduate students Erez Aiden, Jean-Baptiste Michel and Yuan Shen developed a tool for analyzing Google’s digitized books. Known as the N-gram viewer, their tool essentially just counted the number of times a word or phrase appeared in all the digitized publications for a given year. By plotting the year-by-year count, the software produced an N-gram: a chart showing the frequency of word use over time. For example, the chart below shows the frequency of the words “data,” “fact,” and “information” plotted from 1800 to 2000.

This is a powerful tool and there is lots that we can discern from even this basic query (that took all of 5 seconds): “facts” became more and more important through the 19th century, reaching a plateau by about 1920, and then declining sharply from about 1970 onward; “data” and “information” have grown rapidly in popularity in the 20th century, tracking each other closely until about the mid-1980s.

More Read

Digital Strategy and Products: Changing the Game in a Big Way
World Series Analytics
5 Ways Big Data Is Changing the Auto Industry
Using Skip Tracing and Data Mining to Find Off-Market Real Estate
Stock Picking using Data Mining: Parameter Tuning

Aiden and Michel have written about their N-gram viewer and how they came to develop it in a book: Uncharted: Big Data as a Lens on Human Culture (2013). The book gives lots of interesting example of how N-grams might be used to track fame, understand nationalism, explore the birth and death of words and concepts, and follow the influence of inventions. Aiden and Michel see their tool as part of a big data revolution that will transform the way we understand human culture in general and history in particular. “Big data,” they argue, “is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower” (p. 8). They see the N-gram as an instrument, like the microscope, that will provide us with a new quantitative insight into historical change. They call this approach “culturomics” (in analogy with genomics).  

But there are some good reasons to be skeptical, or perhaps even a little worried, about such claims. There are two issues in particular that are suggestive of more general problems for big data approaches.

First, there is the issue of completeness. One of the vaunted features of big data approaches are their ability to utilize very wide data sets. Victor Mayer-Schonberger and Kenneth Cukier have argued that one of the unique features of big data approaches is the use of “all available data.” Of course, Google Books has a very wide (and long) data set – 30 million or so books. And for this reason it is easy to see something like N-gram as working with “all available data” or some close approximation of it. But even this huge amount of text is only 20-25% of all material ever published. This is a large fraction, and it is no doubt valuable for all sorts of purposes. But it raises questions about what is left out. Do we have a representative sample? Are some sorts of books being systematically excluded? How does copyright influence which books are available?

Even leaving aside the question of what included and excluded, we have another problem. Google Books only (at the moment at least) takes account of published materials. Humans have, over the last centuries, created and left behind huge troves of unpublished material – letters, diaries, manuscripts, memos, files, etc. that were never published. Paying attention only to published sources can lead to unbalanced accounts. Publication is expensive and often those who publish written work tend to be those in positions of power and influence (men not women, lords not peasants, owners not workers, colonizers not colonized, etc.). Historians have grappled with this problem for many years – writing balanced history often means finding a way to account for inherently unbalanced sources. In dealing with societies where only a small fraction of the population is literate, one must be even more careful not to take any written material (published or unpublished) as representative of the overall population.

Second, there is the issue of quantitation. Historians attempt to make up for the inadequacy of their sources by using their intuition, judgment, empathy, and imagination. This sort of work may become increasingly marginalized in a world where the humanities and social sciences are subjected to quantitation. Aiden and Michel imagine and celebrate a “scrambling of the boundaries between science and the humanities” (p. 207). Surely, both have a lot to learn from one another. But it is important that quantitation does not dominate history, or other humanities too much. Counting words is important, but it cannot tell us everything. N-gram viewer tells us nothing about the context in which words appear for example. Context is exactly the kind of thing that historians are concerned with. Ultimately, N-gram viewer is a one-dimensional measure of a highly-multi-dimensional space (culture, society, history). 

To be fair, Aiden and Michel are aware of many of the shortcomings of their tool and their book contains some intelligent discussions of its possible limitations. However, big data tools of this type – powerful and yet easy to use – are proving immensely popular. They are seductive. However, here and with big data in general, we should be careful not to confuse having very large amounts of data with having “all data” and we should be careful to temper the drive to quantification with careful judgment and intuition.

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

protecting patient data
How to Protect Psychotherapy Data in a Digital Practice
Big Data Exclusive Security
data analytics
How Data Analytics Can Help You Construct A Financial Weather Map
Analytics Exclusive Infographic
AI use in payment methods
AI Shows How Payment Delays Disrupt Your Business
Artificial Intelligence Exclusive Infographic
financial analytics
Financial Analytics Shows The Hidden Cost Of Not Switching Systems
Analytics Exclusive Infographic

Stay Connected

1.2KFollowersLike
33.7KFollowersFollow
222FollowersPin

You Might also Like

Using Big Data to Optimize Your Media Buying Campaign

4 Min Read
Image
AnalyticsBig Data

The Coming Monetization of Big Data

3 Min Read

A Sunday’s Worth of Numbers: Data Analytics on the Airwaves

2 Min Read

Getting to the Business of Big Data: Overcoming the Marketer’s Dilemma

4 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
ai chatbot
The Art of Conversation: Enhancing Chatbots with Advanced AI Prompts
Chatbots

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?