By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    data science anayst
    Growing Demand for Data Science & Data Analyst Roles
    6 Min Read
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Following the Data
Share
Notification Show More
Latest News
ai in automotive industry
AI Is Changing the Automotive Industry Forever
Artificial Intelligence
SMEs Use AI-Driven Financial Software for Greater Efficiency
Artificial Intelligence
data security in big data age
6 Reasons to Boost Data Security Plan in the Age of Big Data
Big Data
data science anayst
Growing Demand for Data Science & Data Analyst Roles
Data Science
ai software development
Key Strategies to Develop AI Software Cost-Effectively
Artificial Intelligence
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Following the Data
Big Data

Following the Data

hstevens
Last updated: 2014/06/22 at 3:53 AM
hstevens
11 Min Read
SHARE

In this post I want to explore a method for helping us to better understand big data and what it is telling us. I call this method ‘following the data’. This was an approach that I utilized in my book, where I was trying to understand the effect of data on biology.

In this post I want to explore a method for helping us to better understand big data and what it is telling us. I call this method ‘following the data’. This was an approach that I utilized in my book, where I was trying to understand the effect of data on biology. First, I’m going to give some details about how to apply this method, then give some examples of its use, suggesting what kinds of things we might learn from ‘following the data’ and why these might be important.

The 1976 docu-drama All the President’s Men about the Watergate scandal popularized the phrase ‘follow the money’ – tracing the trail of cash would reveal the roots of the corruption. The idea with ‘following the data’ is basically the same. We see the results of big data as flashy charts, visualizations, infographics, or numbers. These are often seductive images or statistics. But how does the data get into this form. The aim of following the data is to trace the path that data takes before arriving at its final destination. Usually, it travels through various pieces of software and hardware – it is poured into databases or spreadsheets, crunched through algorithms, merged with other data, and so on. Through these processes it takes on a variety of shapes and sizes.

Ultimately, following the data also means following it back to its source: understanding where it came from and how it got to be where it ended up. Data is repurposed from all kinds of places and put to all kinds of different uses. Understanding the meanings of big data requires knowing its origins and how the data was shaped and manipulated along the way. Following the data just means paying attention to the sorts of trails (as difficult as they may be to unravel).

More Read

data science anayst

Growing Demand for Data Science & Data Analyst Roles

How Big Data Is Transforming the Maritime Industry
Utilizing Data to Discover Shortcomings Within Your Business Model
Small Businesses Use Big Data to Offset Risk During Economic Uncertainty
The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas

Let’s see where following the data can take us.

First, an example from biology. One kind of image that has become popular recently is CIRCOS (created by Martin Krzywinski). CIRCOS images can (and have) been used to depict different aspect of genomes. The idea is basically that the chromosomes are arranged in a ring around the edge of the image and colored lines are used to show various kinds of links or connections between different locations on different chromosomes. These links could connect anything on the genome: the locations of cancer genes, methylation sites, and so on. Here’s an example:CIRCOS genomic data

This CIRCOS image shows connections between areas implicated with various diseases. The visualization can help us to see at a glance where the important genomic locations are for disease lie. Such an image can be very useful for biologists. But where did this data come from and how was it made into this pretty circle? The ‘references’ section on the page tells us that most of the data comes from the UCSC genome browser, as well as from other databases such as TCAG, OMIM, Cancer Gene Census, and KEGG. These databases all have their own structures and methods for gathering and curating data. Here I want to follow just the data from the UCSC genome browser. This is where the data about the structure and sequence of the human genome in this image comes from. More specifically, it comes from ‘hg18’, an assembly of the genome completed in May 2006. Such ‘assemblies’ are actually the product of sophisticated algorithms (such as the one used by UCSC called GigAssembler) that take millions of pieces of sequenced DNA and put them together into a continuous sequences (something like a huge jigsaw puzzle). These individual pieces of sequence are generated by DNA sequencing machines in a handful of labs spread out across the world. Following the data to its source, we could even ask, which individuals did this DNA come from in the first place? (although, due to confidentiality, we will probably never be able to answer this (although, we do know that over 70% of the DNA for the reference sequence of the human genome came from one male in Buffalo, NY)).

The continuous circle in the image is a kind of illusion. The molecules inside our cells that comprise our genomes do not look anything like this. We can only see the genome in this way because of all the work done by sequencing machines and labs and algorithms to compress it into this shape. Following the data reminds us of the limits of visualization – it is just one way of seeing things, and it suppresses other details and shapes our views in very particular ways. The more detail with which we follow the data the more we see what is left out of a CIRCOS view and how artificial that view is. This is not a criticism of CIRCOS: as I said, such views are extremely very valuable for biology. But we need to keep open the possibility of alternative views too. Following the data exposes the constructedness of particular views and allows us to be critical of what big data representations purport to tell us.

 


 

Second, an example from contemporary culture. In May, the designer and data scientist Matt Daniels came up with this representation of the vocabularies of hip-hop artists. Daniels wanted not only to compare the artists to one another but also compare their linguistic outputs to Shakespeare and Herman Melville. This is certainly pretty fun, and it circulated widely on social media. Daniels compared the number of unique words in the first 35000 words of each rapper’s lyrical oeuvre (comparing it to Shakespeare’s first seven plays and the first 35000 words of Moby Dick).

What happens if we follow the data? The aim of Daniel’s infographic is to make us think about the richness of rap as a form of poetic expression (comparable to Shakespeare perhaps). But the form that the data takes is one-dimensional: it is just counting words. This is probably not a very adequate way of measuring linguistic sophistication and certainly not a good way of performing poetic analysis. Of course, Daniels is just trying to give us a rough comparison here, not a detailed exegesis. But nevertheless, following the data by paying attention to its dimensionality and shape can give us an indication of where some of the shortcomings of an analysis might lie.

Where did Daniels obtain the lyrics to do such an analysis? From a site called Rap Genius. Here is how they describe themselves:

Rap Genius is dedicated to the crowdsourced annotation of music, news, literature, history, and just about any other text you could imagine. We believe in collaborative close reading—that every text is made more understandable, and interesting, by our shared attention. Join us, and help build the world’s greatest public knowledge project.

In other words, these are not any ‘official’ lyrics, but rather lyrics uploaded and edited by fans, experts, and sometimes the artists themselves. Lyrics are annotated with a huge amount of detail regarding their meaning, context, cross-references, and so on. We can trace this data back to a variety of people and places: individuals own listening and transcription, CD case or LP jacket inserts, and the artists own accounts of their verse.

Here, following the data tells us a story about the cultural significance, relevance, and social meanings of such rap lyrics – the fact that such detailed records exist for Daniels to use makes this analysis possible in the first place. Indeed, the data (and its annotation) represent thousands (if not tens of thousands) of hours of work by a large number of individuals (it is a tremendously rich source that even comes with its own rap-only N-gram viewer). Whether or not rap compares to Shakespeare’s verse in terms of vocabulary, rhyme, meter, etc. is perhaps less significant than the fact that it captures the social and cultural attention of so many people who contributed to building such a rich data set. Daniels’ analysis would not be possible (or at least much, much more difficult) without the existence of thousands of individuals who believe that rap is culturally important. This existence of the data is perhaps more telling than the analysis itself.

Following the data back to its source(s) tells us how data sets come into existence and the purposes for which they are assembled – this sort of information is critical for evaluating the meaning and signficance of big data. 

Image credit: Krzywinski, M. et al. ‘Circos: an information aesthetic for comparative genomics‘ Genome Research 19 (2009): 1639-1645.

hstevens June 22, 2014
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai in automotive industry
AI Is Changing the Automotive Industry Forever
Artificial Intelligence
SMEs Use AI-Driven Financial Software for Greater Efficiency
Artificial Intelligence
data security in big data age
6 Reasons to Boost Data Security Plan in the Age of Big Data
Big Data
data science anayst
Growing Demand for Data Science & Data Analyst Roles
Data Science

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

data science anayst
Data Science

Growing Demand for Data Science & Data Analyst Roles

6 Min Read
How Big Data Is Transforming the Maritime Industry
Big Data

How Big Data Is Transforming the Maritime Industry

8 Min Read
utlizing big data for business model
Big Data

Utilizing Data to Discover Shortcomings Within Your Business Model

6 Min Read
big data use in small businesses
Big Data

Small Businesses Use Big Data to Offset Risk During Economic Uncertainty

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?