What Does Taming Big Data Really Cost?

September 18, 2013
295 Views

ImageI’ve written in the past about the hype that surrounds big data. Perhaps one of the most ridiculous urban legends out there today is that working with big data is cheap. This myth is largely driven by the fact that some big data tools, such as Hadoop, don’t come with a license fee.

ImageI’ve written in the past about the hype that surrounds big data. Perhaps one of the most ridiculous urban legends out there today is that working with big data is cheap. This myth is largely driven by the fact that some big data tools, such as Hadoop, don’t come with a license fee.

Before I get started, let me be clear on one thing. Open source products such as Hadoop have many uses and can play important roles in the development of analytic processes. Plus, us stats geeks love to download useful software for free just like anyone else. Who doesn’t want free stuff? The key is to look at what it will cost in total to actually use that software in the context in which you plan to use it. If you aren’t careful to take into account the entire spectrum of costs for your analytic platform, you might just find yourself going way over your planned budget.

Many Price Is Right winners have learned the hard way that their “free” RV came with a large tax bill and heavy ongoing maintenance and operational costs. If you can’t handle those costs, then winning an RV isn’t as exciting as it seems. Similarly, getting a free puppy leads to a lot of ongoing work, as well as costs for food, vets, and myriad other items. It isn’t that winning an RV, or getting a puppy, or deciding to leverage Hadoop is a bad thing. It is simply a matter of being sure you understand what you’re committing to.

Look Beyond License Fees

Regardless of what you pay for your software, license fees are but the beginning of the overall cost of operation for an analytic platform. Other costs include:

  • The hardware that the software will be installed on
  • The space taken and power used by the hardware
  • Configuring & implementing security, resource prioritization, and other operational features
  • Acquiring, loading, and making data ready for use
  • Developing analytic processes on your platform
  • The latency between requests and results being delivered
  • Maintaining the platform
  • Training staff to use and configure the platform
  • Consultants needed for any part of implementation

Certainly there are other costs, but the point should be clear. You can’t simply look at any one factor when determining your costs and deciding which direction to go. A simple comparison of license fees or cost per server won’t lead you to the right decision. In addition, it is also important to look at total cost over the expected life of the platform. Saving on license fees can lower first year costs, but if ongoing costs are more expensive, that can be more than made up for over time.

I believe that the most often underestimated line items relate to the man hours required from employees or consultants to actually stand up, configure, utilize, and maintain an environment. It is absolutely critical to account for these costs. This is one area, for example, that Hadoop can run up your bill. It is a newer, maturing technology that few people are yet familiar with. This leads to a steep learning curve. Again, this isn’t to say the learning curve can’t be worth it. I am simply saying that you have to recognize and account for it.

One of the best discussions I have seen on this topic is a recent paper from Richard Winter at Winter Corp. I encourage you to download and read his report. In it, he develops a framework and provides some examples that lead to very different conclusions on where to invest based on the profile of the data being targeted and the analytics required. He shows how in some cases, Hadoop is a solid fit and in others it doesn’t fit well at all. That’s also true with other technologies. As always, it comes down to what you need to do.

As Teradata states with its Unified Data Architecture, there is a place for Hadoop and other open source products. There is also a place for commercial technologies such as Teradata. Each has a unique total cost profile based upon the type and volume of processing required. By focusing on your total costs, you’ll be better able to allocate your resources over time and you’ll ensure that the right tools and technologies are being used in the right ways.

If you’re reading this, your action item is to make sure your organization is considering all the costs it will incur over time in order to make a solid, appropriate decision as to what tools and technologies serve what roles in your environment. Otherwise, you’ll make it harder for your organization to tame big data.

Originally published by the International Institute for Analytics