Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics
    How Data Analytics Can Help You Construct A Financial Weather Map
    4 Min Read
    financial analytics
    Financial Analytics Shows The Hidden Cost Of Not Switching Systems
    4 Min Read
    warehouse accidents
    Data Analytics and the Future of Warehouse Safety
    10 Min Read
    stock investing and data analytics
    How Data Analytics Supports Smarter Stock Trading Strategies
    4 Min Read
    predictive analytics risk management
    How Predictive Analytics Is Redefining Risk Management Across Industries
    7 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: A Closer Look at RDDs
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Warehousing > A Closer Look at RDDs
Data Warehousing

A Closer Look at RDDs

kingmesal
kingmesal
6 Min Read
Image
SHARE

ImageApache Spark has gotten a lot of attention for its fast processing of large amounts of data. But how does it get up to speed? The biggest reason that Spark is so fast is its use of the Resilient Distributed Dataset, or RDD. 

ImageApache Spark has gotten a lot of attention for its fast processing of large amounts of data. But how does it get up to speed? The biggest reason that Spark is so fast is its use of the Resilient Distributed Dataset, or RDD. 

These RDDs are the heart of Spark—the engine that allows it to process so much data so quickly and reliably.

What Are RDDs?

More Read

moving to the cloud
How Will The Cloud Impact Data Warehousing Technologies?
Double Dutch – Greenplum and Teradata
Reference Domains Part III: Collecting Classifications
Spotlight on Innovation
Five Things You Must Know About Data Warehouse Automation

RDDs are a data structure that’s distinctive of Apache Spark. The “resilient” part comes from the fact that Spark ensures a degree of safety when dealing with data. Operations on RDDs are split into two categories: transformations and actions. (This will be explained later in the article.) Operations on data are made out of chained transformations, which makes it easy to recover from machine loss.

Since MapR made its name as a company supporting Hadoop, it’s not surprising that distributed filesystems are a major focus for Spark. RDDs can be distributed across many nodes, being controlled by a “master” node. If a “slave” machine goes down, the data can be reconstructed from other nodes, like a giant RAID system. Hadoop comes in by providing the reliable, distributed HDFS system to store the data.

RDDs also hold as much data as possible in memory. With powerful nodes, most clusters shouldn’t have any problem.

Lazy Evaluation

You might think that having all of this data in memory might be as slow as molasses, but Spark borrows a technique from functional programming languages like Haskell. RDD operations are lazily evaluated. That means that the operations aren’t actually executed until they’re needed. This lets you specify some complicated operations without suffering the performance hits.

You could work out an algorithm in the interactive Spark shell, do something else, then ask for some output. If you find that it takes a long time, you can put it into a script to be run in batch mode.

This is in contrast to eager evaluation, where everything’s evaluated immediately.

Transformations vs. Actions

Transformations make changes to data, such as sorting, counting, or filtering items that meet certain criteria. These transformations simply return a new RDD from the old one nondestructively. In case of a failure, RDDs can be rebuilt from other transformations, as operations on data are made of chained transformations. Transformations include filtering, mapping, and reducing, as well as operations derived from set theory, such as unions and intersections. 

The Spark documentation shows all of the transformations you can make to your data. You can build all kinds of complex transformations to explore your data out of these simple building blocks.

Actions are operations that produce some kind of output. Actions include counting elements in an RDD, flattening them into arrays, and retrieving the first N items from an RDD. You can also save RDDs in a text file for exporting.

Lineage

A lot of the resiliency in RDDs comes from lineages. As mentioned earlier, transformations are nondestructive, preserving the earlier RDD. Spark relies on lineages, which are records of all the previous versions of RDDs. Since all transformations are made out of other transformations, if a node fails, the master can reconstruct the data out of the lineage on other nodes.

Of course, that assumes that nothing bad has happened to driver application and whether it is running in client or cluster mode. In any case, the lineage feature is very powerful and allows a great degree of safety, but you should focus on good practices such as keeping up-to-date backups and having an off-site backup plan for data that really matters. If you can afford it, you might want to build clusters in another area for continuity. These days, it’s also easy to use cloud providers to have clusters in different places without having to wire up servers yourself.

Conclusion

The heart of a good program is, as any really talented programmer will tell you, the data structures, rather than the algorithm. RDDs provide a good data structure for Big Data because of their reliability combined with their simplicity. With Spark and RDDs, the only limit will be your imagination, especially when backed by the support of MapR’s Spark distribution.

Take a deeper dive into Spark with the free interactive eBook: Getting Started with Spark: From Inception to Production, by James A. Scott.


Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

data analytics
How Data Analytics Can Help You Construct A Financial Weather Map
Analytics Exclusive Infographic
AI use in payment methods
AI Shows How Payment Delays Disrupt Your Business
Artificial Intelligence Exclusive Infographic
financial analytics
Financial Analytics Shows The Hidden Cost Of Not Switching Systems
Analytics Exclusive Infographic
multi model ai
How Teams Using Multi-Model AI Reduced Risk Without Slowing Innovation
Artificial Intelligence Exclusive

Stay Connected

1.2KFollowersLike
33.7KFollowersFollow
222FollowersPin

You Might also Like

HP snaps up Vertica

4 Min Read

Business Analytics and Optimization for the Intelligent…

3 Min Read

On-Demand (or SaaS) Index: Trade or Trend?

7 Min Read

Analytics: Math, Operations Research, Statistics Driving…

1 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?