Gigascience: A Science Journal That Provides Full Data Sets

A few weeks ago I visited the Hong Kong headquarters of the biomedical journal, Gigascience. The journal was established in 2012 with funding from the Chinese company BGI (formerly the Beijing Genomics Institute). What makes Gigascience unique is that it makes available, through its dedicated servers and database the full data sets associated with the articles that it publishes. In addition, the journal develops, maintains, and hosts software and cloud-computing resources for analyzing data. Its goals are “to revolutionize data dissemination, organization, understanding, and use.”

In launching the journal, its editors wrote that one of the main motivations was a growing “reproducibility gap”: the unavailability of data on which scientific conclusions were based meant that it was increasingly difficult, if not impossible, to reproduce the results of experiments. In the worst cases, this enabled outright fraud (or at least made such fraud very difficult to detect). By working towards producing scientific papers that are “executable,” Gigascience hopes to make verification of data, and replication of experiments based on that data, simple, or even automatic.

In the first place, then, Gigascience may represent a new set of standards or norms for communicating scientific information. The scientific journal article has been around since the middle of the seventeenth century. As Alex Csiszar has argued, however, journals really only came into their own in the late eighteenth and nineteenth century when science and scientists needed to find new ways of legitimating their work and bolstering their authority amidst rapidly changing social and political circumstances. In particular, the popular scientific press was posing a significant challenge to scientific legitimacy.

If we accept that Gigascience does represent a new form of scientific communication, then it is interesting to speculate about how it might connect to present-day society and politics. For one thing, it may be indicative of an increasing public skepticism of science – as its editors suggest, Gigascience is designed partly to instill greater confidence in scientific results and avoid scandals based on falsifying data or shoddy analysis. In other words, Gigascience attempts to generate greater openness and transparency in scientific process.

But Gigascience also points to the wider authority that inheres in data itself. Obviously, the journal’s emergence tells us much about the growing epistemic importance of data in science – data is playing a more and more important role relative to more formalized knowledge. But it also points to the growing visibility and authority of data in other realms: the economic, the political, and the social. As society places more trust in data, science is also forced to adapt its norms for legitimation and authorization.

But Gigascience can also tell us what it is about data that is generating its special claims to authority. After all, it is not just that Gigascience (and the papers that it publishes) use data – pretty much all scientific journals do that. Instead, Gigascience specializes in making data widely available, in sharing it, and in communicating it to as many people as possible. The staff at Gigascience who I met with in Hong Kong told me about how Gigascience had facilitated a collaboration between BGI and the International Rice Research Institute. The 3000 Rice Genomes Project placed a massive amount of rice genomic data in the public domain. The aim was to put this data in the hands of small-scale and local researchers who often had more direct contact with farmers but would not have the resources (sequencing machines, servers) to do such sequencing themselves.

Gigascience’s data is useful because it is shared and open. More specifically, it is valuable because it is made available for use and re-use in multiple places and in multiple contexts. The rice genomes might now be used for formulating new rice strains in Africa or for comparison with tulip genomes in Holland. The data is now opened up to all kinds of possibilities and many possible meanings and valances. This suggest something important about (big) data more generally: data is not just useful or interesting for its own sake. Its value comes through its wide circulation and the possibility of its re-use in new, different, and unexpected ways and contexts. Gigascience, and other scientific outlets that are taking on similar ideas (see also this), are important because they are making this possible within science. But, as I have suggested above, modes of authority in science are connected to modes of authority in wider society. As such, it is likely that the value of data more generally is connected to its possibilities for wide circulation and re-use.