4 Considerations When Choosing a Hadoop Distribution

Hadoop elephants Choosing the right Hadoop distribution can be a tricky process. Many businesses looking to adopt Hadoop in their data infrastructure have a hard time figuring out what really differentiates one distribution from another. With so many options available, it’s easy to get lost in the choices.

There are four basic categories that businesses should look at for specific qualifying criteria. Let’s go through these four categories and see what considerations need to be made in order to sift through the options.

1. Performance

Hadoop is the data platform chosen by many because it provides high performance – especially if you replace MapReduce in the Hadoop stack with the Apache Spark data processing engine.

But not all distributions carry a comparable level of performance, even with Apache Spark. Certain distributions can scale much easier (with less hardware), handle far larger quantities of data, and maintain higher levels of performance than others. So what should you look for as you evaluate each Hadoop distribution?

Consistent Low Latency

Many distributions will advertise high performance with low latency. The dirty little secret however is that quite often, this latency is unpredictable. The volatility of this performance quality can be staggeringly erratic. If you set up a POC, make sure to keep this in mind.

Distributed Metadata

A determining factor of a Hadoop distribution’s ability to perform at an enterprise level is how their file system architecture is constructed. One key element of the file system to pay attention to is how they manage your metadata. The most flexible, scalable and dependable versions of Hadoop file systems distribute the metadata among the nodes. This can provide your business with upwards of 20x the performance, and allows for high availability functionality for those working with mission-critical applications.

2. Dependability

Considering that data has become the lifeblood of many businesses today, it’s a shame that dependability hasn’t been a greater priority for data management providers. When looking for a distribution, dependability is a significant differentiator. Very few implementations provide the following dependability features, all of which should be mandatory in your list of priorities.

High Availability

High availability with Hadoop distributions is a rare feature. Only a few implementations can guarantee a system availability of 99.999%. In order to ensure that your provider is giving you real HA functionalities, make sure that they have the following six characteristics:

Self-Healing – No human intervention required to immediately solve a system failure
No Downtime Upon Failure – The system keeps running no matter what
Tolerate Multiple Failures – The system’s failure management capacity can be controlled by administrator preferences
100% Commodity Hardware – No commercial NAS required
No Additional HA Hardware Required – System should run HA capabilities on standard commodity hardware
Easy to Use – HA should be built in
Data Protection

Protecting your data with Hadoop can be done in many ways. The most advanced distributions depend on Snapshot recovery systems. These Point-In-Time Snapshots should utilize the same storage as your live data in order to prevent slowing down your system. Additionally, they should capture data that is both open and closed. Some versions of Hadoop Snapshot systems will only capture data that is closed; this jeopardizes the integrity of your backup data.

Disaster Recovery

Mirroring technology is the preferred method of disaster recovery for enterprise-level Hadoop users. With mirroring, your system should be able to automatically recover from a catastrophic system failure before it’s even noticed.

3. Manageability

Hadoop has evolved into a user-friendly data management system. Different implementations have done their part to optimize Hadoop’s manageability through different administrative tools. Look for a distribution that has intuitive administrative tools that assist in management, troubleshooting, job placement and monitoring.

4. Data Access

Gathering and storing your data is just the beginning of what Hadoop is all about. In order to tap into all of your data’s valuable insights, it’s important that it be easily and securely accessible. This starts with some key architectural data access foundations:

Full access to the Hadoop filesystem API
Full POSIX read/write/update access to files
Direct developer control over key resources
Secure, enterprise-grade search
Comprehensive data access tooling (e.g. Apache Flume, Apache Sqoop, Hive)

In addition to these foundational characteristics, look for security options that come enabled straight out of the box. Security features for Hadoop are commonly left available but unused by many administrators because their implementation has neglected to enable them with their Hadoop offering. The cost and time necessary to maximize security features can be daunting. The best option is to simply choose a distribution that has done the heavy lifting for you.

Conclusion

The world of Hadoop is getting bigger and bigger. The list of options can be overwhelming if you don’t know what you’re looking for. Hopefully these four considerations along with their specific criteria can lead you in the right direction as you search for the best Hadoop distribution for your needs.

If you’re interested in learning more about Hadoop, download the free ebook: The Executive’s Guide to Big Data and Hadoop.