By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: A Closer Look at RDDs
Share
Notification Show More
Latest News
ai in marketing with 3D rendering
Marketers Use AI to Take Advantage of 3D Rendering
Artificial Intelligence
How Big Data Is Transforming the Maritime Industry
How Big Data Is Transforming the Maritime Industry
Big Data
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Warehousing > A Closer Look at RDDs
Data Warehousing

A Closer Look at RDDs

kingmesal
Last updated: 2015/11/20 at 10:12 AM
kingmesal
6 Min Read
Image
SHARE

ImageApache Spark has gotten a lot of attention for its fast processing of large amounts of data. But how does it get up to speed? The biggest reason that Spark is so fast is its use of the Resilient Distributed Dataset, or RDD. 

ImageApache Spark has gotten a lot of attention for its fast processing of large amounts of data. But how does it get up to speed? The biggest reason that Spark is so fast is its use of the Resilient Distributed Dataset, or RDD. 

These RDDs are the heart of Spark—the engine that allows it to process so much data so quickly and reliably.

What Are RDDs?

More Read

What is Data Pipeline A detailed explaination

What is Data Pipeline? A Detailed Explanation

Understanding ETL Tools as a Data-Centric Organization
Differentiating Between Data Lakes and Data Warehouses
How Will The Cloud Impact Data Warehousing Technologies?
Big Data Is More Prevalent in Daily Life Than You Might Think

RDDs are a data structure that’s distinctive of Apache Spark. The “resilient” part comes from the fact that Spark ensures a degree of safety when dealing with data. Operations on RDDs are split into two categories: transformations and actions. (This will be explained later in the article.) Operations on data are made out of chained transformations, which makes it easy to recover from machine loss.

Since MapR made its name as a company supporting Hadoop, it’s not surprising that distributed filesystems are a major focus for Spark. RDDs can be distributed across many nodes, being controlled by a “master” node. If a “slave” machine goes down, the data can be reconstructed from other nodes, like a giant RAID system. Hadoop comes in by providing the reliable, distributed HDFS system to store the data.

RDDs also hold as much data as possible in memory. With powerful nodes, most clusters shouldn’t have any problem.

Lazy Evaluation

You might think that having all of this data in memory might be as slow as molasses, but Spark borrows a technique from functional programming languages like Haskell. RDD operations are lazily evaluated. That means that the operations aren’t actually executed until they’re needed. This lets you specify some complicated operations without suffering the performance hits.

You could work out an algorithm in the interactive Spark shell, do something else, then ask for some output. If you find that it takes a long time, you can put it into a script to be run in batch mode.

This is in contrast to eager evaluation, where everything’s evaluated immediately.

Transformations vs. Actions

Transformations make changes to data, such as sorting, counting, or filtering items that meet certain criteria. These transformations simply return a new RDD from the old one nondestructively. In case of a failure, RDDs can be rebuilt from other transformations, as operations on data are made of chained transformations. Transformations include filtering, mapping, and reducing, as well as operations derived from set theory, such as unions and intersections. 

The Spark documentation shows all of the transformations you can make to your data. You can build all kinds of complex transformations to explore your data out of these simple building blocks.

Actions are operations that produce some kind of output. Actions include counting elements in an RDD, flattening them into arrays, and retrieving the first N items from an RDD. You can also save RDDs in a text file for exporting.

Lineage

A lot of the resiliency in RDDs comes from lineages. As mentioned earlier, transformations are nondestructive, preserving the earlier RDD. Spark relies on lineages, which are records of all the previous versions of RDDs. Since all transformations are made out of other transformations, if a node fails, the master can reconstruct the data out of the lineage on other nodes.

Of course, that assumes that nothing bad has happened to driver application and whether it is running in client or cluster mode. In any case, the lineage feature is very powerful and allows a great degree of safety, but you should focus on good practices such as keeping up-to-date backups and having an off-site backup plan for data that really matters. If you can afford it, you might want to build clusters in another area for continuity. These days, it’s also easy to use cloud providers to have clusters in different places without having to wire up servers yourself.

Conclusion

The heart of a good program is, as any really talented programmer will tell you, the data structures, rather than the algorithm. RDDs provide a good data structure for Big Data because of their reliability combined with their simplicity. With Spark and RDDs, the only limit will be your imagination, especially when backed by the support of MapR’s Spark distribution.

Take a deeper dive into Spark with the free interactive eBook: Getting Started with Spark: From Inception to Production, by James A. Scott.


kingmesal November 20, 2015
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai in marketing with 3D rendering
Marketers Use AI to Take Advantage of 3D Rendering
Artificial Intelligence
How Big Data Is Transforming the Maritime Industry
How Big Data Is Transforming the Maritime Industry
Big Data
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

[mc4wp_form id=”1616″]

You Might also Like

What is Data Pipeline A detailed explaination
Big Data

What is Data Pipeline? A Detailed Explanation

8 Min Read
etl for data-driven businesses
Big Data

Understanding ETL Tools as a Data-Centric Organization

8 Min Read
data lake vs data warehouse
Data Lake

Differentiating Between Data Lakes and Data Warehouses

7 Min Read
moving to the cloud
Big DataCloud ComputingData WarehousingExclusive

How Will The Cloud Impact Data Warehousing Technologies?

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?