By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: A Guide to Spark Streaming – Code Examples Included
Share
Notification Show More
Latest News
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing
become a data scientist
Boosting Your Chances for Landing a Job as a Data Scientist
Jobs
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > A Guide to Spark Streaming – Code Examples Included
Big DataData MiningHadoopMapReduceUnstructured Data

A Guide to Spark Streaming – Code Examples Included

kingmesal
Last updated: 2016/02/25 at 1:00 PM
kingmesal
6 Min Read
Image
SHARE

Apache Spark is great for processing large amounts of data over large clusters, but wouldn’t it be great if you could process data in near real time? You can with Spark Streaming.

Contents
What Is Spark Streaming?What kinds of data can you analyze?DStream and RDDsSpark Streaming TransformationsConclusion

Apache Spark is great for processing large amounts of data over large clusters, but wouldn’t it be great if you could process data in near real time? You can with Spark Streaming.

What Is Spark Streaming?


ImageSpark Streaming is a special SparkContext that you can use for processing data quickly in near-time. It’s similar to the standard SparkContext, which is geared toward batch operations. Spark Streaming uses a little trick to create small batch windows (micro batches) that offer all of the advantages of Spark: safe, fast data handling and lazy evaluation combined with real-time processing. It’s a combination of both batch and interactive processing.

More Read

utlizing big data for business model

Utilizing Data to Discover Shortcomings Within Your Business Model

Small Businesses Use Big Data to Offset Risk During Economic Uncertainty
The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
How Big Data Is Transforming the Renewable Energy Sector
How Pictographs Make Technical Data More User Friendly

You can adjust the window for processing latency down to half a second, but this is more memory intensive. Spark Streaming is used for everything ranging from credit card fraud detection to the identification of threats on the Internet.

What kinds of data can you analyze?

You might be wondering what kind of data you can ingest into Apache Spark. The short answer is pretty much everything.

More specifically, you can import data from Twitter, Flume, Kafka, ZeroMQ, a custom feed, and HDFS. You can also export into HDFS, as well as other databases, applications, and dashboards.

DStream and RDDs

How is all this possible? It all comes down to the primary data type of Spark: the RDD, or Resilient Distributed Dataset. An RDD is an abstraction of the data that’s held in memory, which is a lot faster than storing and fetching things from disk. This already gives you a significant speed boost over other systems.

RDDs are also safer to use because the transformations keep the original data in lineages, returning new RDDs with the transformations applied. This allows Spark to reconstruct the data with all the changes should something go wrong with one of the nodes in the cluster, such as a power failure.

DStream takes the concept of RDDs and applies it to streams. A DStream is simply a stream of RDDs, giving all of the advantages of speed and safety in near real time. The DStream API offers a limited set of transformations compared to the standard Apache Spark.

Spark Streaming Transformations

If you’re wondering what kind of transformations you can do on DStreams, they’re pretty similar to the standard Spark transformations.

We’ll borrow some examples from the Apache Spark Reference Card to give you a taste. Let’s pretend we’re reading data over some kind of stream, such as from a social media feed. 

Let’s start our Spark Streaming Context:

val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream(“localhost”, 9999)

For example, map(func) takes func as an argument and applies it to each element, returning a new RDD.

Here’s an example multiplying each line by 10:

lines.map(x=>x.toInt*10).print()

We’ll send some data with the Netcat or nc program available on most Unix-like systems. Spark is reading from port 9999, so we’ll have to make sure Netcat points there.

prompt> nc –lk 9999
12
34

Here’s what the output looks like:

120
340

flatmap() is similar, but can return 0 or more items.

This example splits text, putting each word on separate lines:

lines.flatMap(_.split(” “)).print()

Let’s try it with the string “Spark is fun”:

prompt> nc –lk 9999
Spark is fun

And here’s the output:

Spark
is
fun

count() is obvious enough. It counts the number of data elements.

We can count the lines in our stream:

lines.flatMap(_.split(” “)).count()

And here are some lines:

prompt> nc –lk 9999
say
hello
to
spark

The output should be 4.

reduce() is similar, but applies a function as an argument to the data elements instead of just adding them.

We can use this to add up all the numbers:

lines.map(x=>x.toInt).reduce(_+_).print()

And let’s get some numbers into Spark Streaming:

prompt> nc –lk 9999
1
3
5
7

The answer should be 16.

countByValue() counts the number of occurrences of each data set.

We can use this to count the number of times each word occurs:

lines.countByValue().print()

We’ll include some duplicate lines just to show you how it works:

prompt>nc –lk 9999
spark
spark
is
fun
fun

The output will look like this:

(is,1)
(spark,2)
(fun,2)

An alternate way we could do this is by using the reduceByKey() function:

val words = lines.flatMap(_.split(” “))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_)
wordCounts.print()

Conclusion

By now, you have seen the power of Spark Streaming and what it can do for your near-real-time big data needs. If you want to experiment, you can download a private Sandbox, a full virtual machine with Apache Spark that you can play around with.

TAGGED: big data
kingmesal February 25, 2016
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

[mc4wp_form id=”1616″]

You Might also Like

utlizing big data for business model
Big Data

Utilizing Data to Discover Shortcomings Within Your Business Model

6 Min Read
big data use in small businesses
Big Data

Small Businesses Use Big Data to Offset Risk During Economic Uncertainty

7 Min Read
data-driven approach in healthcare
Analytics

The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas

6 Min Read
big data transforming renewable energy sector
Big Data

How Big Data Is Transforming the Renewable Energy Sector

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?