Vendor Lock-in and the Big Data Ecosystem — What Does it Really Mean?

January 20, 2016



I read a lot of propaganda these days. Not in a national or international political sense, but within the business world. Here’s an example; there are certain companies that would have you believe that when you purchase a license for software that is backed with an enterprise guarantee which delivers enterprise quality, you will be locked-in to that vendor. That is as far from the truth as you can get, so let’s take a look as to why that’s the case.

Examples of Lock-in

There are multiple ways a vendor can lock you into their technology. Let’s say that you need to implement a business intelligence (BI) platform to build reports and dashboards to support your business needs. In general, as soon as reports have been built, visualizations created and you are relying upon that BI tool to support your daily activities, you are locked-in. You cannot easily switch to another BI vendor, because the tooling of the interface, the types of visualizations, and even dashboards are not generally based on any accepted industry standards. The BI tools themselves support connecting to multiple data sources, but the tool itself is not based on an agreed-upon standard. This is a type of vendor lock-in.

Relational database vendors have typically been considered to cause vendor lock-in. In certain cases this is true, but not always. Each relational database vendor does in fact implement standards. They also implement and support non-standards. Sometimes users of those systems find it of great value to utilize both the standard and non-standard features. However, as soon as a user “depends” on those non-standard features they will be somewhat locked into that vendor’s product.

What these examples should help you understand is that there are cases where you will have no choice but to accept some form of vendor lock-in. There are other cases where a vendor gives you the choice if you want to be locked into said vendor. Choosing to lock yourself into a vendor, is well, your own choice.

No Lock-in Model

Now that we have examined BI tools and relational databases as examples, let’s discuss the Hadoop ecosystem and vendor lock-in.


Apache Hadoop provides a standard (accepted) set of APIs to interact with the Hadoop platform. Any software built to support those APIs can play in the ecosystem. The Hadoop distributed file system (HDFS) is an API first and an implementation second. This means that anyone can implement and support the HDFS API atop their own file system. Amongst the big three names in Hadoop (formerly the big five as Intel and EMC have exited the market), none of the three will support their competitors’ distribution. The question this raises for me is “does vendor lock-in within Hadoop look like vendor lock-in within enterprise relational databases?” I say no, and here is why:

  1. Hadoop is driven by the Apache Foundation which is not vendor driven.
  2. Hadoop implements APIs like HDFS, HBase and others that anyone is free to implement.
  3. Other industry standards exist that can be implemented atop of Hadoop if a vendor so chooses.

Given these details, let’s take a moment to compare the Hadoop ecosystem to the relational database ecosystem. If you store a petabyte of data in a relational database and are not using the proprietary extensions offered by that database vendor, there is no one out there that will say you are locked-in to that vendor. They may, however, say “good luck migrating off of it,” because that is a lot of data to export from one system into another. If you store a petabyte of data in a Hadoop system, you can easily migrate to another Hadoop vendor using standard tools like distcp, which allows you to migrate your data from one Hadoop cluster to another. If you have software written against the APIs provided by the Hadoop ecosystem like HDFS or HBase, you can move to another vendor at your discretion. Will it be difficult to move your data? That depends on the volume of data you have, but you can still move to another vendor.

Comparison to Cloud Vendors

These days running a data center in the cloud, or on-premise and cloud — sometimes referred to as a hybrid model — is an option for many companies. Cloud vendors are notorious for vendor lock-in. If running some part of your data center in the cloud is an option then there is a comparison here that is worth noting. Vendor lock-in becomes very apparent with cloud provided services like Amazon with its proprietary APIs and services, Amazon Redshift would be one. Google BigTable on the other hand implements the HBase API which prevents vendor lock-in just like MapR-DB. If this is a topic of interest to you, there are more details in an article published on Fierce CIO about choosing a cloud vendor.


The only lock-in that exists within the Hadoop ecosystem is the lock-in you create for yourself. By that, I mean that some vendors provide tools to make your life easier. If one of the vendors offers you administrative capabilities that prevent you from having to do a lot of extra work, then you are “locking” yourself into a better way of life. You could always leave that vendor at your own discretion, and know that you will have to spend more time on those same tasks. Paying for simplicity is not vendor lock-in.

The bottom line is that you shouldn’t believe the hype or fear [FUD] in marketing materials from certain companies that have no marketable differentiation; they just want you to think that other vendors lock you in when it just isn’t true.

If you are leveraging software within the Hadoop ecosystem, understand that vendor lock-in is a scare tactic that is employed by certain vendors. They do NOT want you to understand the details I have shared with you in this article. Share this knowledge with others; it’s important to help everyone understand what vendor lock-in means within the Hadoop ecosystem. Keep the power on your side.