Data Infrastructures

It is obvious to say that big data rely on computers. It would be practically impossible to implement big data algorithms with pencil and paper – it would just take too much paper and too long to write all the data down. But the scale of big data is not the only thing that ties it to computers – big data are dependent on computer hardware and software in all kinds of ways. Data has to be manipulated into certain forms to get inside computers, to be transmitted over networks, and so on.

These shapings of data are often hidden from the direct view of data users. But they are important – the shapes that data can take, the ways in which it can be manipulated determine the kinds of things that data can show and tell us. I call these shapes ‘data infrastructures’ – that is, all the structures and forms that data must take inside the computer in order to be manipulated as big data. To give a sense of how significant such infrastructures are and what kinds of influence they have on our thinking, I will explore one ubiquitous example here in some detail.

In the twenty-first century, perhaps the most significant data infrastructure of all is the World Wide Web (WWW). What is the structure of the WWW? Well, it’s a web, of course! The WWW’s most important feature, and arguably the reason for its great success, is the hyperlink: text from any one WWW source can be “marked up” so that it forms a “link” to any other WWW source. This means that data can be cross-linked, suggesting ways of reading and writing that are multiple and non-linear.

What was the context in which this system was designed and the purposes it was intended to serve? In the 1980s, Tim Berners-Lee, the WWW’s designer, was a computer programmer working at the Counseil Europeen pour la Recherche Nucleaire (CERN – a massive high energy physics lab straddling the border between France and Switzerland). Berners-Lee saw a failure of information management: the many computers at CERN stored information in different ways and in different formats. Although the computers were networked, there was little way to practically find out anything about what was stored on other machines. Berners-Lee saw work being duplicated and effort wasted due to the inability to share data effectively. The WWW was his solution: it was intended radically expand the circulation of all kinds of information within the closed community of CERN.

In the late 1980s and early 1990s, the WWW was not the only solution to the problem of managing information in an online network. As various networks around the world were joined together into the Internet, different ideas emerged as to how to organize all this newly accessible information. For instance, in 1991, a team at the University of Minnesota released Gopher – a protocol for retrieving documents over the Internet. Unlike the WWW, Gopher consisted of a series on menus: if you wanted to find a page about, for example, mosquitoes, you might navigate to a menu of animals, then to a menu of insects, and then to the page you want. In the early 1990s, Gopher was a real alternative to the WWW – it imposed more hierarchy and organization on data and was therefore faster and more intuitive for finding many kinds of information.

I describe Gopher to show that in fact, however much we now take the WWW for granted, in fact it is one amongst several possible alternatives. It is one particular way of structuring information and the relationships between different pieces of information. It was designed for a particular purpose and that purpose rendered the structure of the WWW particularly decentralized, freeform, and non-hierarchical. This has some advantages, such as accommodating many different kinds of information. But it also has some disadvantages, such as a lack of organization or indexing of information (a problem we have had to solve using search engines).

In any case, the structure of the WWW is not neutral. It makes doing some tasks easier and others harder; it makes some paths or connections simple to follow, others not so simple. Structures like the WWW are so ubiquitous that they become invisible. This, however, does not diminish their importance.

When it comes to big data, we find data infrastructures everywhere: from the structure of hard drives to the organization of algorithms and databases. These physical and virtual structures place constraints on how data can be organized, processed, and accessed. Understanding the advantages and disadvantages of big data (in various forms) means understanding these structures – in particular it means knowing where they came from and what they were designed to do. Ultimately, what we get out of big data will constrained by the structures we put it into.