By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics in sports industry
    Here’s How Data Analytics In Sports Is Changing The Game
    6 Min Read
    data analytics on nursing career
    Advances in Data Analytics Are Rapidly Transforming Nursing
    8 Min Read
    data analytics reveals the benefits of MBA
    Data Analytics Technology Proves Benefits of an MBA
    9 Min Read
    data-driven image seo
    Data Analytics Helps Marketers Substantially Boost Image SEO
    8 Min Read
    construction analytics
    5 Benefits of Analytics to Manage Commercial Construction
    5 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”
Share
Notification Show More
Latest News
data analytics in sports industry
Here’s How Data Analytics In Sports Is Changing The Game
Big Data
data analytics on nursing career
Advances in Data Analytics Are Rapidly Transforming Nursing
Analytics
data analytics reveals the benefits of MBA
Data Analytics Technology Proves Benefits of an MBA
Analytics
anti-spoofing tips
Anti-Spoofing is Crucial for Data-Driven Businesses
Security
ai in software development
3 AI-Based Strategies to Develop Software in Uncertain Times
Software
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”
Big Data

The N-gram and the Book “Uncharted: Big Data as a Lens on Human Culture”

hstevens
Last updated: 2014/10/04 at 10:57 AM
hstevens
7 Min Read
SHARE
- Advertisement -

Between 2007 and 2010, graduate students Erez Aiden, Jean-Baptiste Michel and Yuan Shen developed a tool for analyzing Google’s digitized books. Known as the N-gram viewer, their tool essentially just counted the number of times a word or phrase appeared in all the digitized publications for a given year.

Between 2007 and 2010, graduate students Erez Aiden, Jean-Baptiste Michel and Yuan Shen developed a tool for analyzing Google’s digitized books. Known as the N-gram viewer, their tool essentially just counted the number of times a word or phrase appeared in all the digitized publications for a given year. By plotting the year-by-year count, the software produced an N-gram: a chart showing the frequency of word use over time. For example, the chart below shows the frequency of the words “data,” “fact,” and “information” plotted from 1800 to 2000.

This is a powerful tool and there is lots that we can discern from even this basic query (that took all of 5 seconds): “facts” became more and more important through the 19th century, reaching a plateau by about 1920, and then declining sharply from about 1970 onward; “data” and “information” have grown rapidly in popularity in the 20th century, tracking each other closely until about the mid-1980s.

More Read

data analytics in sports industry

Here’s How Data Analytics In Sports Is Changing The Game

Advances in Data Analytics Are Rapidly Transforming Nursing
Data Analytics Helps Marketers Substantially Boost Image SEO
Data Visualization Boosts Business Scalability with Sales Mapping
Data-Driven Marketing Offers Huge Benefits for Landscapers

Aiden and Michel have written about their N-gram viewer and how they came to develop it in a book: Uncharted: Big Data as a Lens on Human Culture (2013). The book gives lots of interesting example of how N-grams might be used to track fame, understand nationalism, explore the birth and death of words and concepts, and follow the influence of inventions. Aiden and Michel see their tool as part of a big data revolution that will transform the way we understand human culture in general and history in particular. “Big data,” they argue, “is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower” (p. 8). They see the N-gram as an instrument, like the microscope, that will provide us with a new quantitative insight into historical change. They call this approach “culturomics” (in analogy with genomics).  

But there are some good reasons to be skeptical, or perhaps even a little worried, about such claims. There are two issues in particular that are suggestive of more general problems for big data approaches.

First, there is the issue of completeness. One of the vaunted features of big data approaches are their ability to utilize very wide data sets. Victor Mayer-Schonberger and Kenneth Cukier have argued that one of the unique features of big data approaches is the use of “all available data.” Of course, Google Books has a very wide (and long) data set – 30 million or so books. And for this reason it is easy to see something like N-gram as working with “all available data” or some close approximation of it. But even this huge amount of text is only 20-25% of all material ever published. This is a large fraction, and it is no doubt valuable for all sorts of purposes. But it raises questions about what is left out. Do we have a representative sample? Are some sorts of books being systematically excluded? How does copyright influence which books are available?

Even leaving aside the question of what included and excluded, we have another problem. Google Books only (at the moment at least) takes account of published materials. Humans have, over the last centuries, created and left behind huge troves of unpublished material – letters, diaries, manuscripts, memos, files, etc. that were never published. Paying attention only to published sources can lead to unbalanced accounts. Publication is expensive and often those who publish written work tend to be those in positions of power and influence (men not women, lords not peasants, owners not workers, colonizers not colonized, etc.). Historians have grappled with this problem for many years – writing balanced history often means finding a way to account for inherently unbalanced sources. In dealing with societies where only a small fraction of the population is literate, one must be even more careful not to take any written material (published or unpublished) as representative of the overall population.

Second, there is the issue of quantitation. Historians attempt to make up for the inadequacy of their sources by using their intuition, judgment, empathy, and imagination. This sort of work may become increasingly marginalized in a world where the humanities and social sciences are subjected to quantitation. Aiden and Michel imagine and celebrate a “scrambling of the boundaries between science and the humanities” (p. 207). Surely, both have a lot to learn from one another. But it is important that quantitation does not dominate history, or other humanities too much. Counting words is important, but it cannot tell us everything. N-gram viewer tells us nothing about the context in which words appear for example. Context is exactly the kind of thing that historians are concerned with. Ultimately, N-gram viewer is a one-dimensional measure of a highly-multi-dimensional space (culture, society, history). 

To be fair, Aiden and Michel are aware of many of the shortcomings of their tool and their book contains some intelligent discussions of its possible limitations. However, big data tools of this type – powerful and yet easy to use – are proving immensely popular. They are seductive. However, here and with big data in general, we should be careful not to confuse having very large amounts of data with having “all data” and we should be careful to temper the drive to quantification with careful judgment and intuition.

hstevens October 4, 2014
Share this Article
Facebook Twitter Pinterest LinkedIn
Share
- Advertisement -

Follow us on Facebook

Latest News

data analytics in sports industry
Here’s How Data Analytics In Sports Is Changing The Game
Big Data
data analytics on nursing career
Advances in Data Analytics Are Rapidly Transforming Nursing
Analytics
data analytics reveals the benefits of MBA
Data Analytics Technology Proves Benefits of an MBA
Analytics
anti-spoofing tips
Anti-Spoofing is Crucial for Data-Driven Businesses
Security

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

data analytics in sports industry
Big Data

Here’s How Data Analytics In Sports Is Changing The Game

6 Min Read
data analytics on nursing career
Analytics

Advances in Data Analytics Are Rapidly Transforming Nursing

8 Min Read
data-driven image seo
Analytics

Data Analytics Helps Marketers Substantially Boost Image SEO

8 Min Read
data visualization for small business
Data Visualization

Data Visualization Boosts Business Scalability with Sales Mapping

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence
data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?