By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics in sports industry
    Here’s How Data Analytics In Sports Is Changing The Game
    6 Min Read
    data analytics on nursing career
    Advances in Data Analytics Are Rapidly Transforming Nursing
    8 Min Read
    data analytics reveals the benefits of MBA
    Data Analytics Technology Proves Benefits of an MBA
    9 Min Read
    data-driven image seo
    Data Analytics Helps Marketers Substantially Boost Image SEO
    8 Min Read
    construction analytics
    5 Benefits of Analytics to Manage Commercial Construction
    5 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: 4 Considerations When Choosing a Hadoop Distribution
Share
Notification Show More
Latest News
big data mac performance
Data-Driven Tips to Optimize the Speed of Macs
News
3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows
3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows
Artificial Intelligence
data analytics in sports industry
Here’s How Data Analytics In Sports Is Changing The Game
Big Data
data analytics on nursing career
Advances in Data Analytics Are Rapidly Transforming Nursing
Analytics
data analytics reveals the benefits of MBA
Data Analytics Technology Proves Benefits of an MBA
Analytics
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Software > Hadoop > 4 Considerations When Choosing a Hadoop Distribution
AnalyticsBig DataBusiness IntelligenceHadoopITMapReduceOpen SourceSoftware

4 Considerations When Choosing a Hadoop Distribution

Davemendle
Last updated: 2015/06/18 at 6:44 AM
Davemendle
7 Min Read
Hadoop elephants
SHARE

Hadoop elephantsChoosing the right Hadoop distribution can be a tricky process. Many businesses looking to adopt Hadoop in their data infrastructure have a hard time figuring out what really differentiates one distribution from another. With so many options available, it’s easy to get lost in the choices.

Hadoop elephantsChoosing the right Hadoop distribution can be a tricky process. Many businesses looking to adopt Hadoop in their data infrastructure have a hard time figuring out what really differentiates one distribution from another. With so many options available, it’s easy to get lost in the choices.

There are four basic categories that businesses should look at for specific qualifying criteria. Let’s go through these four categories and see what considerations need to be made in order to sift through the options.

1. Performance

More Read

3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows

3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows

Here’s How Data Analytics In Sports Is Changing The Game
Advances in Data Analytics Are Rapidly Transforming Nursing
Data Analytics Technology Proves Benefits of an MBA
Anti-Spoofing is Crucial for Data-Driven Businesses

Hadoop is the data platform chosen by many because it provides high performance – especially if you replace MapReduce in the Hadoop stack with the Apache Spark data processing engine.

But not all distributions carry a comparable level of performance, even with Apache Spark. Certain distributions can scale much easier (with less hardware), handle far larger quantities of data, and maintain higher levels of performance than others. So what should you look for as you evaluate each Hadoop distribution?

Consistent Low Latency

Many distributions will advertise high performance with low latency. The dirty little secret however is that quite often, this latency is unpredictable. The volatility of this performance quality can be staggeringly erratic. If you set up a POC, make sure to keep this in mind.

Distributed Metadata

A determining factor of a Hadoop distribution’s ability to perform at an enterprise level is how their file system architecture is constructed. One key element of the file system to pay attention to is how they manage your metadata. The most flexible, scalable and dependable versions of Hadoop file systems distribute the metadata among the nodes. This can provide your business with upwards of 20x the performance, and allows for high availability functionality for those working with mission-critical applications.

2. Dependability

Considering that data has become the lifeblood of many businesses today, it’s a shame that dependability hasn’t been a greater priority for data management providers. When looking for a distribution, dependability is a significant differentiator. Very few implementations provide the following dependability features, all of which should be mandatory in your list of priorities.

High Availability

High availability with Hadoop distributions is a rare feature. Only a few implementations can guarantee a system availability of 99.999%. In order to ensure that your provider is giving you real HA functionalities, make sure that they have the following six characteristics:

  1. Self-Healing – No human intervention required to immediately solve a system failure
  2. No Downtime Upon Failure – The system keeps running no matter what
  3. Tolerate Multiple Failures – The system’s failure management capacity can be controlled by administrator preferences
  4. 100% Commodity Hardware – No commercial NAS required
  5. No Additional HA Hardware Required – System should run HA capabilities on standard commodity hardware
  6. Easy to Use – HA should be built in
  7. Data Protection

Protecting your data with Hadoop can be done in many ways. The most advanced distributions depend on Snapshot recovery systems. These Point-In-Time Snapshots should utilize the same storage as your live data in order to prevent slowing down your system. Additionally, they should capture data that is both open and closed. Some versions of Hadoop Snapshot systems will only capture data that is closed; this jeopardizes the integrity of your backup data.

Disaster Recovery

Mirroring technology is the preferred method of disaster recovery for enterprise-level Hadoop users. With mirroring, your system should be able to automatically recover from a catastrophic system failure before it’s even noticed.

3. Manageability

Hadoop has evolved into a user-friendly data management system. Different implementations have done their part to optimize Hadoop’s manageability through different administrative tools. Look for a distribution that has intuitive administrative tools that assist in management, troubleshooting, job placement and monitoring.

4. Data Access

Gathering and storing your data is just the beginning of what Hadoop is all about. In order to tap into all of your data’s valuable insights, it’s important that it be easily and securely accessible. This starts with some key architectural data access foundations:

  • Full access to the Hadoop filesystem API
  • Full POSIX read/write/update access to files
  • Direct developer control over key resources
  • Secure, enterprise-grade search
  • Comprehensive data access tooling (e.g. Apache Flume, Apache Sqoop, Hive)

In addition to these foundational characteristics, look for security options that come enabled straight out of the box. Security features for Hadoop are commonly left available but unused by many administrators because their implementation has neglected to enable them with their Hadoop offering. The cost and time necessary to maximize security features can be daunting. The best option is to simply choose a distribution that has done the heavy lifting for you.

Conclusion

The world of Hadoop is getting bigger and bigger. The list of options can be overwhelming if you don’t know what you’re looking for. Hopefully these four considerations along with their specific criteria can lead you in the right direction as you search for the best Hadoop distribution for your needs.

If you’re interested in learning more about Hadoop, download the free ebook: The Executive’s Guide to Big Data and Hadoop. 

Davemendle June 18, 2015
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

big data mac performance
Data-Driven Tips to Optimize the Speed of Macs
News
3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows
3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows
Artificial Intelligence
data analytics in sports industry
Here’s How Data Analytics In Sports Is Changing The Game
Big Data
data analytics on nursing career
Advances in Data Analytics Are Rapidly Transforming Nursing
Analytics

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows
Artificial Intelligence

3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows

6 Min Read
data analytics in sports industry
Big Data

Here’s How Data Analytics In Sports Is Changing The Game

6 Min Read
data analytics on nursing career
Analytics

Advances in Data Analytics Are Rapidly Transforming Nursing

8 Min Read
data analytics reveals the benefits of MBA
Analytics

Data Analytics Technology Proves Benefits of an MBA

9 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?