Hadoop in the Big Data Stack

Hadoop What is Hadoop? It’s another wacky name for an open-source software project, but Hadoop was also a significant advancement in the way that companies, governments, and organizations can collect, store, and process data. Companies like Cloudera, Hortonworks, and others have emerged to deliver professional level, Hadoop-based data solutions for the enterprise, while many organizations have built successful Hadoop implementations on their own.

Where the traditional model for data storage put data in one container and processing occurred elsewhere, Hadoop and other distributed file systems moved the storage and computing to a group of connected machines, often simply off the shelf computing power. One Big Data Week panelist even dubbed his first Hadoop project, cobbled together from existing equipment, ‘Frankendoop’ (That same panelist also gave us the term ‘Big Data Landfill’, but more on that later). By combining storage and analysis, Hadoop has created a more flexible, if slower, platform for moving and manipulating data. By spreading storage and analysis across machines Hadoop also spreads the workload, turning large jobs into many smaller jobs and performing certain jobs much faster.

This transition of data storage and processing power to what’s called commodity hardware did two things:

made it incredibly easy to expand storage capacity at very little cost, and
removed the very real barriers to data access that exist with a traditional enterprise data warehouse.

Now the flexibility of volume and the types of data collected begins to match the requirements of real business use cases. Where the data warehouse required careful data management, Hadoop’s approach allows for frequent data dumps, giving organizations the ability to treat data storage however they want.

And that’s where the Big Data Landfill comes from. While the greater Apache Hadoop project includes a number of analysis tools, and vendors like Cloudera offer their own tools promising ease of use and additional functionality over the open source alternatives, most Hadoop initiatives function primarily as bulk storage. Hadoop is becoming the new de facto storage for Big Data, putting them in the center of the standard big data project.

While few are predicting that Hadoop and other flexible distributed file systems will completely replace traditional data storage, the momentum, community support, and open-source nature of the Hadoop project mean that it will likely continue to grow more entangled in the Big Data stack. With that market growth, new technologies are already emerging to take advantage of the distributed nature of Hadoop and the possibilities of effective data analysis.