Calculating the Soft Costs of Hadoop

Estimating the total cost of ownership (TCO) for Hadoop has been a challenge. Costs can be hidden or unanticipated during the planning and deployment stages. But time – and good analytics – have proven that the TCO between Hadoop distributions can vary by as much as 50%. Architecture makes all the difference, particularly when it comes to soft costs – the Hadoop environment. Choose wisely, or your cost-effective Big Data deployment may turn into a huge money-and-resource waster.

The Cost of Expertise

The rush to deploy Hadoop is understandable: what enterprise wouldn’t want a solution which has removed all restrictions on how much data a business can store, process, and analyze? Information can be collected from assorted touch points and integrated with vast amounts of previously underutilized or ignored data. New discoveries can be made from seemingly disparate data points, creating actionable information that can provide that much-desired competitive edge. Bottom-line: you can never know too much about your market – and Big Data, when properly utilized, provides understandings that humans simply can’t correlate from the unceasing flow of data.

But Hadoop can be a bit of a wild card. To get the most out of any Big Data project, and to avoid unpleasant surprises down the line, businesses should go into the Hadoop adoption process with eyes wide open. Among the most critical questions to answer is whether the right skill-set to deploy, maintain, and secure Hadoop exists within the organization.

After adopting Hadoop, many companies quickly realize that they simply do not have the in- house expertise to make it run smoothly, let alone to make Hadoop enterprise-ready. At a bare minimum, for success with Hadoop, companies need an IT staff that understands block size configurations, knows what to do if they lose a NameNode, and comprehends the ins and outs of HBase and it’s less-than-straightforward interactions with Hadoop Distributed File System (HDFS).

An enterprise with Hadoop wizards on staff can essentially choose any distribution and end up with an agile, robust deployment. Most businesses, though, underestimate how much time and effort can go into a Hadoop project. And if the necessary chores aren’t carried out, the deployment will cease to be useful very quickly.

In one example cited by CITO Research, in the recent white paper “Five Questions to Ask Before Choosing a Hadoop Distribution,” a manufacturing company that had deployed Hadoop estimated that, shortly after deployment, it was using less than 4% of its Big Data. Due to configuration issues that could not be addressed by the company’s in-house IT, users were experiencing constant downtime during NameNode bottlenecks and upgrades. Many analysts simply opted not to use Hadoop because of these challenges, meaning that all the data in Hadoop was going to waste. The company deployed another distribution of Hadoop and were able to utilize over 75% of their Big Data.

The Cost of Security and Backup

Before a company begins to utilize captured data in new ways, it must ensure that all personally identifiable and sensitive information is classified, and then managed and protected to help ensure privacy and security. New policies may need to be developed to address these issues. User access controls and roles may need to be redefined and implemented. Employees and executives may need to receive training. These are all costly endeavors, but the benefits of a properly deployed Hadoop distribution should provide enough benefits to more than compensate for the “start-up” costs.

Unfortunately, some deployments will demand significant resource and financial investments, offsetting benefits and raising their TCO dramatically over time. As an example, Hadoop’s security features are generally not robust or flexible enough for enterprise use. For compliance and risk management reasons, virtually all enterprises need to utilize multiple levels of encryption – disk encryption and wire-level encryption – to secure data traveling between nodes in the cluster. Apache Hadoop offers neither form of encryption.

To greatly reduce TCO, enterprises will want to look for a distribution that supports native wire-level encryption implemented using public-private key pairs and disk encryption capabilities, along with authentication methods supported for other applications.

Enterprises also, rather obviously, need backup capabilities for their Hadoop clusters. Here again, hidden costs can arise well after deployment. The replication process in HDFS offers protection from disk failure, but it is not immune to human errors. Further, if a file is corrupted, that corrupted file will be automatically replicated across the cluster, exacerbating the problem.

When you roll new code into production, you need a backup in case something goes awry and you need to roll the system back. Apache Hadoop doesn’t offer this capability. To avoid this type of exposure, and the costs in lost productivity that accompany it, businesses should consider a commercial Hadoop distribution with snapshot capabilities. Users can take point-in-time snapshots of every file and table. If an error occurs, the snapshot can be restored.

Your TCO and Hadoop – Calculating The Real Cost

While general TCO issues can be examined and predicted, the true TCO of any given deployment is unique to each enterprise. Skill sets, regulatory requirements, disaster recovery, and many other factors come into play.

A good place to begin the evaluation (or redeployment process) is with the Hadoop TCO Calculator, which provides a personalized overview of the true costs for deploying and running various distributions of Hadoop. You use your own data with this self-service tool and can change the inputs in real time in order to estimate costs across a number of variables in different scenarios. Access the Hadoop TCO Calculator here.