Following the Data

In this post I want to explore a method for helping us to better understand big data and what it is telling us. I call this method ‘following the data’. This was an approach that I utilized in my book, where I was trying to understand the effect of data on biology. First, I’m going to give some details about how to apply this method, then give some examples of its use, suggesting what kinds of things we might learn from ‘following the data’ and why these might be important.

The 1976 docu-drama All the President’s Men about the Watergate scandal popularized the phrase ‘follow the money’ – tracing the trail of cash would reveal the roots of the corruption. The idea with ‘following the data’ is basically the same. We see the results of big data as flashy charts, visualizations, infographics, or numbers. These are often seductive images or statistics. But how does the data get into this form. The aim of following the data is to trace the path that data takes before arriving at its final destination. Usually, it travels through various pieces of software and hardware – it is poured into databases or spreadsheets, crunched through algorithms, merged with other data, and so on. Through these processes it takes on a variety of shapes and sizes.

Ultimately, following the data also means following it back to its source: understanding where it came from and how it got to be where it ended up. Data is repurposed from all kinds of places and put to all kinds of different uses. Understanding the meanings of big data requires knowing its origins and how the data was shaped and manipulated along the way. Following the data just means paying attention to the sorts of trails (as difficult as they may be to unravel).

Let’s see where following the data can take us.

First, an example from biology. One kind of image that has become popular recently is CIRCOS (created by Martin Krzywinski). CIRCOS images can (and have) been used to depict different aspect of genomes. The idea is basically that the chromosomes are arranged in a ring around the edge of the image and colored lines are used to show various kinds of links or connections between different locations on different chromosomes. These links could connect anything on the genome: the locations of cancer genes, methylation sites, and so on. Here’s an example: CIRCOS genomic data

This CIRCOS image shows connections between areas implicated with various diseases. The visualization can help us to see at a glance where the important genomic locations are for disease lie. Such an image can be very useful for biologists. But where did this data come from and how was it made into this pretty circle? The ‘references’ section on the page tells us that most of the data comes from the UCSC genome browser, as well as from other databases such as TCAG, OMIM, Cancer Gene Census, and KEGG. These databases all have their own structures and methods for gathering and curating data. Here I want to follow just the data from the UCSC genome browser. This is where the data about the structure and sequence of the human genome in this image comes from. More specifically, it comes from ‘hg18’, an assembly of the genome completed in May 2006. Such ‘assemblies’ are actually the product of sophisticated algorithms (such as the one used by UCSC called GigAssembler) that take millions of pieces of sequenced DNA and put them together into a continuous sequences (something like a huge jigsaw puzzle). These individual pieces of sequence are generated by DNA sequencing machines in a handful of labs spread out across the world. Following the data to its source, we could even ask, which individuals did this DNA come from in the first place? (although, due to confidentiality, we will probably never be able to answer this (although, we do know that over 70% of the DNA for the reference sequence of the human genome came from one male in Buffalo, NY)).

The continuous circle in the image is a kind of illusion. The molecules inside our cells that comprise our genomes do not look anything like this. We can only see the genome in this way because of all the work done by sequencing machines and labs and algorithms to compress it into this shape. Following the data reminds us of the limits of visualization – it is just one way of seeing things, and it suppresses other details and shapes our views in very particular ways. The more detail with which we follow the data the more we see what is left out of a CIRCOS view and how artificial that view is. This is not a criticism of CIRCOS: as I said, such views are extremely very valuable for biology. But we need to keep open the possibility of alternative views too. Following the data exposes the constructedness of particular views and allows us to be critical of what big data representations purport to tell us.

Second, an example from contemporary culture. In May, the designer and data scientist Matt Daniels came up with this representation of the vocabularies of hip-hop artists. Daniels wanted not only to compare the artists to one another but also compare their linguistic outputs to Shakespeare and Herman Melville. This is certainly pretty fun, and it circulated widely on social media. Daniels compared the number of unique words in the first 35000 words of each rapper’s lyrical oeuvre (comparing it to Shakespeare’s first seven plays and the first 35000 words of Moby Dick).

What happens if we follow the data? The aim of Daniel’s infographic is to make us think about the richness of rap as a form of poetic expression (comparable to Shakespeare perhaps). But the form that the data takes is one-dimensional: it is just counting words. This is probably not a very adequate way of measuring linguistic sophistication and certainly not a good way of performing poetic analysis. Of course, Daniels is just trying to give us a rough comparison here, not a detailed exegesis. But nevertheless, following the data by paying attention to its dimensionality and shape can give us an indication of where some of the shortcomings of an analysis might lie.

Where did Daniels obtain the lyrics to do such an analysis? From a site called Rap Genius. Here is how they describe themselves:

Rap Genius is dedicated to the crowdsourced annotation of music, news, literature, history, and just about any other text you could imagine. We believe in collaborative close reading—that every text is made more understandable, and interesting, by our shared attention. Join us, and help build the world’s greatest public knowledge project.

In other words, these are not any ‘official’ lyrics, but rather lyrics uploaded and edited by fans, experts, and sometimes the artists themselves. Lyrics are annotated with a huge amount of detail regarding their meaning, context, cross-references, and so on. We can trace this data back to a variety of people and places: individuals own listening and transcription, CD case or LP jacket inserts, and the artists own accounts of their verse.

Here, following the data tells us a story about the cultural significance, relevance, and social meanings of such rap lyrics – the fact that such detailed records exist for Daniels to use makes this analysis possible in the first place. Indeed, the data (and its annotation) represent thousands (if not tens of thousands) of hours of work by a large number of individuals (it is a tremendously rich source that even comes with its own rap-only N-gram viewer). Whether or not rap compares to Shakespeare’s verse in terms of vocabulary, rhyme, meter, etc. is perhaps less significant than the fact that it captures the social and cultural attention of so many people who contributed to building such a rich data set. Daniels’ analysis would not be possible (or at least much, much more difficult) without the existence of thousands of individuals who believe that rap is culturally important. This existence of the data is perhaps more telling than the analysis itself.

Following the data back to its source(s) tells us how data sets come into existence and the purposes for which they are assembled – this sort of information is critical for evaluating the meaning and signficance of big data.

Image credit: Krzywinski, M. et al. ‘Circos: an information aesthetic for comparative genomics‘ Genome Research 19 (2009): 1639-1645.