By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: What is Hadoop?
Share
Notification Show More
Latest News
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing
become a data scientist
Boosting Your Chances for Landing a Job as a Data Scientist
Jobs
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > What is Hadoop?
Business IntelligenceData MiningData WarehousingPredictive Analytics

What is Hadoop?

TonyBain
Last updated: 2009/01/24 at 6:44 AM
TonyBain
5 Min Read
SHARE

Image via CrunchBase Ok so you are setting out to build the next Google and are considering using a Map/Reduce based data access strategy over traditional SQL. Just as you need a database server to process SQL queries you also…

Image representing Hadoop as depicted in Crunc...Image via CrunchBase

Ok so you are setting out to build the next Google and are considering using a Map/Reduce based data access strategy over traditional SQL.  Just as you need a database server to process SQL queries you also require the underlying infrastructure to manage your data and to execute your Map/Reduce routines.  Hadoop is one such system that is gaining acceptance, being co-developed and implemented for data analytics purposes at Yahoo and Facebook amongst others.

More Read

ai digital marketing tools

Top Five AI-Driven Digital Marketing Tools in 2023

Is AI-Generated Content a Net Positive for Businesses?
Predictive Analytics Helps New Dropshipping Businesses Thrive
Combat AI-Powered Threats with Cybersecurity Simulations & Other Practices
Utilizing Data to Discover Shortcomings Within Your Business Model

Hadoop is the system that allows unstructured data to be distributed across hundreds or thousands of machines forming shared nothing clusters, and the execution of Map/Reduce routines to run on the data in that cluster.  Hadoop has its own filesystem which replicates data to multiple nodes to ensure  if one node holding data goes down, there are at least 2 other nodes from which to retrieve that piece of information.  This protects the data availability from node failure, something which is critical when there are many nodes in a cluster (aka RAID at a server level).

So will Hadoop outperform a RDBMS?  Well unless you are dealing with very large volumes of unstructured data (hundreds of GB, TB’s or PB’s) and have large numbers of machines available you will likely find the performance of Hadoop running a Map/Reduce query much slower than a comparable SQL query on a relational database.  Hadoop uses a brute force access method whereas RDBMS’s have optimization methods for accessing data such as indexes and read-ahead.  The benefits really do only come into play when the positive of mass parallelism is achieved, or the data is unstructured to the point where no RDBMS optimizations can be applied to help the performance of queries.  Indeed benchmarks from the Hadoop site show performance significantly slower in straight line query performance when compared to a relational DB on small scale tests.

 
MySql 5.0.27Hadoop-0.15.2
DataB-tree disk table (MyISAM)Text files (access_log)
Machine12
Rows5,914,6695,914,669
Results100100
Time4.43 sec172.30 sec

But with all benchmarks everything has to be taken into consideration.  For example, if the data starts life in a text file in the file system (e.g. a log file) the cost associated with extracting that data from the text file and structuring it into a standard schema and loading it into the RDBMS has to be considered.  And if you have to do that for 1000 or 10,000 log files that may take minutes or hours or days to do (with Hadoop you still have to copy the files to its file system).  It may also be practically impossible to load such data into a RDBMS for some environments as data could be generated in such a volume that a load process into a RDBMS cannot keep up.  So while using Hadoop your query time may be slower (speed improves with more nodes in the cluster) but potentially your access time to the data may be improved. 

Also as there aren’t any mainstream RDBMS’s that scale to thousands of nodes, at some point the sheer mass of brute force processing power will outperform the optimized, but restricted on scale, relational access methods.

So while Hadoop and Map/Reduce are gaining more popularity it shouldn’t be considered a like for like alternative to a relational RDBMS for most applications.  It is a specialized tool with a specialized set of criteria that need to be fulfilled to achieve benefit over more traditional approaches.

Related articles by Zemanta
  • Yahoo Search Wants to Be More Like Google, Embraces Hadoop
  • Yahoo’s Supercomputing Initiative Running Hadoop
Reblog this post [with Zemanta]


Link to original postInnovations in information management

TonyBain January 24, 2009
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

[mc4wp_form id=”1616″]

You Might also Like

ai digital marketing tools
Artificial Intelligence

Top Five AI-Driven Digital Marketing Tools in 2023

6 Min Read
ai-generated content
Artificial Intelligence

Is AI-Generated Content a Net Positive for Businesses?

5 Min Read
predictive analytics in dropshipping
Predictive Analytics

Predictive Analytics Helps New Dropshipping Businesses Thrive

12 Min Read
cybersecurity simulations
Artificial IntelligenceExclusiveITSecurity

Combat AI-Powered Threats with Cybersecurity Simulations & Other Practices

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?