Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    big data and customer service outsourcing
    How Data Analytics Improves Customer Service Outsourcing
    18 Min Read
    How a Specialized Marketing VA Improves Campaign Analytics
    How a Specialized Marketing VA Improves Campaign Analytics
    11 Min Read
    New Data Analytics Breakthroughs Give eCommerce Startups a Fighting Chance
    New Data Analytics Breakthroughs Give eCommerce Startups a Fighting Chance
    6 Min Read
    How Data Analytics Is Reshaping Patient Financing Decisions
    How Data Analytics Is Reshaping Patient Financing Decisions
    13 Min Read
    business using business intelligence
    How to Use a Competitive Intelligence Dashboard to Turn Market Data Into Smarter Marketing Decisions 
    9 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: A Closer Look at RDDs
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Warehousing > A Closer Look at RDDs
Data Warehousing

A Closer Look at RDDs

kingmesal
kingmesal
6 Min Read
Image
SHARE

ImageApache Spark has gotten a lot of attention for its fast processing of large amounts of data. But how does it get up to speed? The biggest reason that Spark is so fast is its use of the Resilient Distributed Dataset, or RDD. 

ImageApache Spark has gotten a lot of attention for its fast processing of large amounts of data. But how does it get up to speed? The biggest reason that Spark is so fast is its use of the Resilient Distributed Dataset, or RDD. 

These RDDs are the heart of Spark—the engine that allows it to process so much data so quickly and reliably.

What Are RDDs?

More Read

Open-Source-Data-Vault-Models
What Really Is Big Data? And Why It Will Change the World
Introducing the Government Big Data Newsletter
Short-term “Trouble for Big Business Intelligence Vendors” may lead to longer-term advantage
How Your Hadoop Distribution Could Lose Your Data Forever

RDDs are a data structure that’s distinctive of Apache Spark. The “resilient” part comes from the fact that Spark ensures a degree of safety when dealing with data. Operations on RDDs are split into two categories: transformations and actions. (This will be explained later in the article.) Operations on data are made out of chained transformations, which makes it easy to recover from machine loss.

Since MapR made its name as a company supporting Hadoop, it’s not surprising that distributed filesystems are a major focus for Spark. RDDs can be distributed across many nodes, being controlled by a “master” node. If a “slave” machine goes down, the data can be reconstructed from other nodes, like a giant RAID system. Hadoop comes in by providing the reliable, distributed HDFS system to store the data.

RDDs also hold as much data as possible in memory. With powerful nodes, most clusters shouldn’t have any problem.

Lazy Evaluation

You might think that having all of this data in memory might be as slow as molasses, but Spark borrows a technique from functional programming languages like Haskell. RDD operations are lazily evaluated. That means that the operations aren’t actually executed until they’re needed. This lets you specify some complicated operations without suffering the performance hits.

You could work out an algorithm in the interactive Spark shell, do something else, then ask for some output. If you find that it takes a long time, you can put it into a script to be run in batch mode.

This is in contrast to eager evaluation, where everything’s evaluated immediately.

Transformations vs. Actions

Transformations make changes to data, such as sorting, counting, or filtering items that meet certain criteria. These transformations simply return a new RDD from the old one nondestructively. In case of a failure, RDDs can be rebuilt from other transformations, as operations on data are made of chained transformations. Transformations include filtering, mapping, and reducing, as well as operations derived from set theory, such as unions and intersections. 

The Spark documentation shows all of the transformations you can make to your data. You can build all kinds of complex transformations to explore your data out of these simple building blocks.

Actions are operations that produce some kind of output. Actions include counting elements in an RDD, flattening them into arrays, and retrieving the first N items from an RDD. You can also save RDDs in a text file for exporting.

Lineage

A lot of the resiliency in RDDs comes from lineages. As mentioned earlier, transformations are nondestructive, preserving the earlier RDD. Spark relies on lineages, which are records of all the previous versions of RDDs. Since all transformations are made out of other transformations, if a node fails, the master can reconstruct the data out of the lineage on other nodes.

Of course, that assumes that nothing bad has happened to driver application and whether it is running in client or cluster mode. In any case, the lineage feature is very powerful and allows a great degree of safety, but you should focus on good practices such as keeping up-to-date backups and having an off-site backup plan for data that really matters. If you can afford it, you might want to build clusters in another area for continuity. These days, it’s also easy to use cloud providers to have clusters in different places without having to wire up servers yourself.

Conclusion

The heart of a good program is, as any really talented programmer will tell you, the data structures, rather than the algorithm. RDDs provide a good data structure for Big Data because of their reliability combined with their simplicity. With Spark and RDDs, the only limit will be your imagination, especially when backed by the support of MapR’s Spark distribution.

Take a deeper dive into Spark with the free interactive eBook: Getting Started with Spark: From Inception to Production, by James A. Scott.


Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

big data and customer service outsourcing
How Data Analytics Improves Customer Service Outsourcing
Analytics Exclusive
The End of Unstructured Marketing: Forcing Generative AI into Strict HTML Schemas
The End of Unstructured Marketing: Forcing Generative AI into Strict HTML Schemas
Artificial Intelligence Exclusive
How a Specialized Marketing VA Improves Campaign Analytics
How a Specialized Marketing VA Improves Campaign Analytics
Analytics Exclusive
ai marketing tools
The 9 AI Tools Marketers Use to Create Images and Video in 2026
Artificial Intelligence Exclusive

Stay Connected

1.2KFollowersLike
33.7KFollowersFollow
222FollowersPin

You Might also Like

Business Intelligence/Data Warehousing Emerging Trends (but not breakouts): 9 for ’09

3 Min Read
Image
Best PracticesBig DataData WarehousingHadoopMarket ResearchPrivacy

My 7 Big Data Favorites of 2014

3 Min Read

The Future of Market Research | The Institute For The Future,…

2 Min Read

#14: Here’s a thought…

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai chatbot
How AI Website Chatbots Improve Customer Support and Lead Generation
Chatbots Exclusive
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-26 SmartData Collective. All Rights Reserved.
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?