The Big Data Capacity Crisis

Big data is BIG – I’ve mentioned that before I think! If all of the data in the world was printed out – there would be no one to count how many pages it filled as everyone would have suffocated under a mountain of paper.

Computers offer far more efficient storage capacity than paper – information stored magnetically on disks can be compressed to a much greater degree. But even so, are we reaching a capacity crisis?

With the amount of data being generated worldwide increasing at a rate of 40%-60% per year (depending on who you believe), some people certainly think so. The EMC Corporation’s Digital Universe 2014 report states “data is growing so fast there are bound to be difficulties ahead.”

According to them, in 2007 the amount of new data being generated (and captured) exceeded the amount of digital storage space being created each year – and by 2011 it more than doubled it.

The reason that we have not yet run out is due to the fact that much of the data that is generated is transient – it is only stored temporarily for as long as it is immediately useful, and then written over – for example streamed content from Youtube or Netflix which is stored locally until you have finished watching it.

The fact remains though, that the growth in worldwide volume of data is increasingly outpacing the manufacture of physical storage space – after all it is a lot easier to generate digital data than to build devices like hard disks, optical devices and solid-state storage. Intel’s Jim Held told a conference on the matter, way back in 2010: “Walmart adds a billion rows per minute to it’s database, Youtube contains as much data as all the commercial networks broadcast in a year, and the Large Hadron Collider can generate terabytes of data per second.”

As more businesses cotton on to the benefits of capturing and storing as much data as possible, for the purpose of analytics, it is likely there will be a need to permanently archive more and more of this data. The Internet of Things will add to this considerably.

However increasing the capacity of today’s storage devices available to match the demand of keeping up with this growth would be a huge task – impossible, some experts are predicting. And that isn’t even taking into account the ecological and environmental consequences – today’s biggest data centers require huge amounts of power to operate and generate a lot of heat. This is why they are often located in places such as Iceland where they take advantage of low temperatures for system cooling, and geothermal energy for providing power.

But there is always the hope that new advances in technology will bring new solutions. Despite the emergence of more advanced storage technologies such as optical and solid state, mechanical, magnetic hard disks of the sort in use since the 1970s are still used to store the vast majority of the world’s data. Primarily because they are cheap – but this may have to change if the demand for storage space continues to outstrip supply.

Cloud services may provide some of the answers – by optimizing their filing systems, companies like Google and Dropbox are working to reduce the level of redundancy stored by duplicated data held across multiple accounts – for example when one person emails a 5MB PDF to 100 people, half a gigabyte of duplicated data is instantly created. Technologies are being developed to make efficiencies here.

At some point though, we will inevitably have to move beyond magnetic disk storage. Scientists at Harvard have already discovered a way of storing around 700 terabytes of digital data on one strand of synthesized DNA, weighing just one gram. Looking even further into the future, it is proposed that digital information can be “injected” into bacteria. And taking things to the ultimate extreme, quantum computing suggests that data can be encoded in subatomic systems – experiments have been carried out using the rate of electron spin in nitrogen atoms found within diamonds as a carrier of information.

This is all science fiction at the moment – the speed at which the information can be read back from DNA is currently far too slow for it to have practical use in mainstream computing, and the information encoded into the electron spin in nitrogen is so unstable that the act of reading it causes it to be erased – but these ideas offer a few suggestions about how we might attempt to solve the looming capacity crisis that many fear we are facing.

—–

As always, I hope my posts are useful. For more, please check out my other posts in The Big Data Guru column and feel free to connect with me via Twitter, LinkedIn, Facebook, Slideshare and The Advanced Performance Institute.