The Beginner's Guide to Hadoop - SmartData Collective

It’s strange to think that something named after a toy elephant could become so influential within the tech community, and yet that’s exactly what has happened with Hadoop. Perhaps this shouldn’t come as a shock. After all, in just the past few years, big data analytics has grown in popularity, with many businesses and organizations finding ways to analyze the data they collect, discovering interesting new insights. And with increased interest in big data, Hadoop is naturally going to be involved. Though Hadoop is more well-known, it still can be hard to understand for those who aren’t closely tied with data science. This has become a bit of a problem, especially for businesses where top level executives have very little expertise when it comes to big data. Despite this, there remains a definite need to get to know Hadoop. Consider this, then, a type of resource that beginners can turn to in order to become just a bit more familiar with the platform.

Put in fairly simple terms, Hadoop is an open source software framework used for big data. Or, as the Apache Software Foundation explains, it’s a framework that “allows for the distributed processing of large data sets across clusters of computers using simple programming models.” This definition may still be a bit too complicated for newcomers to fully grasp, though. Think of Hadoop instead as a platform that simply makes it easier to manage big data and perform analytics. There’s no real need to get into the nitty gritty details of what it is and how it works if you’re just starting out, so keep that admittedly basic idea in mind.

Hadoop is sort of an off-shoot of a project that originally went by the name of Nutch. The goal was to get faster web search results by distributing the needed data and calculations across multiple different machines. One of the men behind the project, Doug Cutting, took the idea when he went to work for Yahoo, eventually dividing Nutch into two parts. Hadoop would be the part that focused more on distributed computing and processing. Cutting would name the platform after his son’s toy elephant, which is why you see an elephant used as Hadoop’s mascot and on its logo. In 2008, Yahoo would release Hadoop to the public at large as an open source project, giving more people than ever before the opportunity to contribute, improve, and utilize the platform.

With Hadoop’s history in mind, it’s important to know what some of its benefits are and how it works. Two characteristics need to be understood to some degree. First, Hadoop can store very large amounts of data. In fact, the constraints normally associated with storage get thrown out the window. Now you can store data across more than one node or server, effectively getting rid of certain storage limits. Second, processing data happens in a similar manner. This processing of data is called MapReduce, another term you’ve probably heard before. With MapReduce, data isn’t moved over a network to the software like in traditional methods. Instead, the software is taken to the data itself. This makes processing it a lot faster, something that most businesses can take advantage of to varying degrees.

The benefits don’t end there with Hadoop. In addition to increased storage and computing power, Hadoop protects against hardware failure. In other words, because of its distributive nature, if a node or server were to go down, jobs would still be protected by being redirected to other nodes. Copies of the data are stored, so you won’t lose out on your work. Hadoop is also relatively easy to scale and flexible enough to handle many different types of data, like videos, images, text, and more structured sources. That’s not to mention the benefit of it being low cost due to its open source nature.

That’s not to say Hadoop is without challenges. Hadoop security remains an issue, though there are ways to address the problems. Hadoop is also very complex, and there’s a noticeable talent gap concerning those who can actually use it well. Even with the challenges, Hadoop provides tremendous opportunities to those that want to use it for big data analytics. Many large companies like Google and IBM employ Hadoop, and the potential is there for many more to do so. Hopefully this beginner’s guide can get you started on that path.