By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    football analytics
    The Role of Data Analytics in Football Performance
    9 Min Read
    data Analytics instagram stories
    Data Analytics Helps Marketers Make the Most of Instagram Stories
    15 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    What to Know Before Recruiting an Analyst to Handle Company Data
    6 Min Read
    AI analytics
    AI-Based Analytics Are Changing the Future of Credit Cards
    6 Min Read
    data overload showing data analytics
    How Does Next-Gen SIEM Prevent Data Overload For Security Analysts?
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”
Share
Notification Show More
Aa
SmartData CollectiveSmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”
Big Data

The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”

hstevens
Last updated: 2014/10/04 at 10:57 AM
hstevens
7 Min Read
SHARE

Between 2007 and 2010, graduate students Erez Aiden, Jean-Baptiste Michel and Yuan Shen developed a tool for analyzing Google’s digitized books. Known as the N-gram viewer, their tool essentially just counted the number of times a word or phrase appeared in all the digitized publications for a given year.

Between 2007 and 2010, graduate students Erez Aiden, Jean-Baptiste Michel and Yuan Shen developed a tool for analyzing Google’s digitized books. Known as the N-gram viewer, their tool essentially just counted the number of times a word or phrase appeared in all the digitized publications for a given year. By plotting the year-by-year count, the software produced an N-gram: a chart showing the frequency of word use over time. For example, the chart below shows the frequency of the words “data,” “fact,” and “information” plotted from 1800 to 2000.

This is a powerful tool and there is lots that we can discern from even this basic query (that took all of 5 seconds): “facts” became more and more important through the 19th century, reaching a plateau by about 1920, and then declining sharply from about 1970 onward; “data” and “information” have grown rapidly in popularity in the 20th century, tracking each other closely until about the mid-1980s.

More Read

Shutterstock Licensed Photo - 1051059293 | Rawpixel.com

QR Codes Leverage the Benefits of Big Data in Education

The Role of Data Analytics in Football Performance
7 Mind-Blowing Ways Smart Homes Use Data to Save Your Money
What to Know Before Recruiting an Analyst to Handle Company Data
Tackling Bias in AI Translation: A Data Perspective

Aiden and Michel have written about their N-gram viewer and how they came to develop it in a book: Uncharted: Big Data as a Lens on Human Culture (2013). The book gives lots of interesting example of how N-grams might be used to track fame, understand nationalism, explore the birth and death of words and concepts, and follow the influence of inventions. Aiden and Michel see their tool as part of a big data revolution that will transform the way we understand human culture in general and history in particular. “Big data,” they argue, “is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower” (p. 8). They see the N-gram as an instrument, like the microscope, that will provide us with a new quantitative insight into historical change. They call this approach “culturomics” (in analogy with genomics).  

But there are some good reasons to be skeptical, or perhaps even a little worried, about such claims. There are two issues in particular that are suggestive of more general problems for big data approaches.

First, there is the issue of completeness. One of the vaunted features of big data approaches are their ability to utilize very wide data sets. Victor Mayer-Schonberger and Kenneth Cukier have argued that one of the unique features of big data approaches is the use of “all available data.” Of course, Google Books has a very wide (and long) data set – 30 million or so books. And for this reason it is easy to see something like N-gram as working with “all available data” or some close approximation of it. But even this huge amount of text is only 20-25% of all material ever published. This is a large fraction, and it is no doubt valuable for all sorts of purposes. But it raises questions about what is left out. Do we have a representative sample? Are some sorts of books being systematically excluded? How does copyright influence which books are available?

Even leaving aside the question of what included and excluded, we have another problem. Google Books only (at the moment at least) takes account of published materials. Humans have, over the last centuries, created and left behind huge troves of unpublished material – letters, diaries, manuscripts, memos, files, etc. that were never published. Paying attention only to published sources can lead to unbalanced accounts. Publication is expensive and often those who publish written work tend to be those in positions of power and influence (men not women, lords not peasants, owners not workers, colonizers not colonized, etc.). Historians have grappled with this problem for many years – writing balanced history often means finding a way to account for inherently unbalanced sources. In dealing with societies where only a small fraction of the population is literate, one must be even more careful not to take any written material (published or unpublished) as representative of the overall population.

Second, there is the issue of quantitation. Historians attempt to make up for the inadequacy of their sources by using their intuition, judgment, empathy, and imagination. This sort of work may become increasingly marginalized in a world where the humanities and social sciences are subjected to quantitation. Aiden and Michel imagine and celebrate a “scrambling of the boundaries between science and the humanities” (p. 207). Surely, both have a lot to learn from one another. But it is important that quantitation does not dominate history, or other humanities too much. Counting words is important, but it cannot tell us everything. N-gram viewer tells us nothing about the context in which words appear for example. Context is exactly the kind of thing that historians are concerned with. Ultimately, N-gram viewer is a one-dimensional measure of a highly-multi-dimensional space (culture, society, history). 

To be fair, Aiden and Michel are aware of many of the shortcomings of their tool and their book contains some intelligent discussions of its possible limitations. However, big data tools of this type – powerful and yet easy to use – are proving immensely popular. They are seductive. However, here and with big data in general, we should be careful not to confuse having very large amounts of data with having “all data” and we should be careful to temper the drive to quantification with careful judgment and intuition.

hstevens October 4, 2014
Share This Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

Shutterstock Licensed Photo - 1051059293 | Rawpixel.com
QR Codes Leverage the Benefits of Big Data in Education
Big Data
football analytics
The Role of Data Analytics in Football Performance
Analytics Big Data Exclusive
smart home data
7 Mind-Blowing Ways Smart Homes Use Data to Save Your Money
Big Data
ai low code frameworks
AI Can Help Accelerate Development with Low-Code Frameworks
Artificial Intelligence

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

Shutterstock Licensed Photo - 1051059293 | Rawpixel.com
Big Data

QR Codes Leverage the Benefits of Big Data in Education

7 Min Read
football analytics
AnalyticsBig DataExclusive

The Role of Data Analytics in Football Performance

9 Min Read
smart home data
Big Data

7 Mind-Blowing Ways Smart Homes Use Data to Save Your Money

7 Min Read
analyst,women,looking,at,kpi,data,on,computer,screen
Analytics

What to Know Before Recruiting an Analyst to Handle Company Data

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?