By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    data-driven image seo
    Data Analytics Helps Marketers Substantially Boost Image SEO
    8 Min Read
    construction analytics
    5 Benefits of Analytics to Manage Commercial Construction
    5 Min Read
    benefits of data analytics for financial industry
    Fascinating Changes Data Analytics Brings to Finance
    7 Min Read
    analyzing big data for its quality and value
    Use this Strategic Approach to Maximize Your Data’s Value
    6 Min Read
    data-driven seo for product pages
    6 Tips for Using Data Analytics for Product Page SEO
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: ggplot2 for Big Data
Share
Notification Show More
Latest News
anti-spoofing tips
Anti-Spoofing is Crucial for Data-Driven Businesses
Security
ai in software development
3 AI-Based Strategies to Develop Software in Uncertain Times
Software
ai in ppc advertising
5 Proven Tips for Utilizing AI with PPC Advertising in 2023
Artificial Intelligence
data-driven image seo
Data Analytics Helps Marketers Substantially Boost Image SEO
Analytics
ai in web design
5 Ways AI Technology Has Disrupted Website Development
Artificial Intelligence
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Visualization > ggplot2 for Big Data
Data VisualizationR Programming Language

ggplot2 for Big Data

DavidMSmith
Last updated: 2011/10/21 at 4:20 PM
DavidMSmith
4 Min Read
SHARE
- Advertisement -

(Hadley Wickham, author of ggplot2 and several other R packages, guest blogs today about forthcoming big-data improvements to his R graphics package — ed.) 

- Advertisement -

(Hadley Wickham, author of ggplot2 and several other R packages, guest blogs today about forthcoming big-data improvements to his R graphics package — ed.) 

Hi! I’m Hadley Wickham and I’m guest posting on the Revolutions blog to give you a taste of some of the visualisation work that my research team and I worked on this summer. This work has been generously funded by Revolution Analytics and while, as you’ll see, it works particularly well with RevoScaleR, it’s also contributing to changes that will help all ggplot2 users.

More Read

data visualization for small business

Data Visualization Boosts Business Scalability with Sales Mapping

Choosing the Right Programming Language for A Corporate Database
5 Best Practices for Extracting, Analyzing, and Visualizing Data
10 Important Ways Data Visualization Can Benefit Your Content Strategy
How to Bring Presentation Data to Life with Powered Template

This summer three undergrads, James Rigby, Jonathan Stewart, Hyun Bin Kang and one grad student, Ben White, have been working on the answer to an important question: how can we make a scatterplot work when you have hundreds of millions of points? Scatterplots are one of the most important tools of exploratory data analysis, but they start to break down even with relatively small datasets because of overplotting: you can’t tell how many points are plotted at each location. They also get slower and slower the more points you try to draw.

The answer to both these problems is relatively simple: instead of plotting the raw data, plot densities, or distributions. These can generated simply (by binning and counting the number of points in each bin), or with more sophistication (by smoothing the bin counts to get a kernel density estimate). RevoScaleR makes this process incredibly fast: you can bin tens of millions observations in a few seconds on commodity hardware, and a kernel density estimate only takes a fraction more to compute.

Once you have the density, what can you do with it? The following plots show two of the ideas that we came up with. The examples show the diamonds data set from ggplot2, but the beauty of these techniques is that they’ll work regardless of how much data you have – the extra complexity is taken care of in density computation stage.

- Advertisement -

A scatterplot of depth and table coloured by the z dimension is uninformative because of the extreme amount of overplotting:

Continuous-scatter

One way to make it better is bin and smooth in 3d and then plot the conditional distribution of z at multiple values of depth and table.

Continuous-density
The shape shows the distribution of z, and the colour shows the total density at that location — higher values mean there are more data points in that location. This plot reveals much more than the previous one: most of the data points are concentrated near depth 56 and table 60 where the distribution of z is skewed towards smaller values. As depth increases, the average value of z also seems to increase.

A scatterplot of carat vs. price coloured by the colour of the diamond:

- Advertisement -

Discrete-scatter
There is some hint that J colours are relatively cheaper (for a given size they have lower prices) but it’s hard to see anything else because of the overplotting. Binning the data and displaying the distribution of colour in each bin makes the important patterns much easier to see.

Discrete-density
Colours D, E and F are more expensive side, and H, I and particularly J are cheaper. Only bins containing more than 100 points are included to avoid drawing the eye to regions with little data.

I’m currently with working another student, Yue Hu, to turn our research into a robust R package.

DavidMSmith October 21, 2011
Share this Article
Facebook Twitter Pinterest LinkedIn
Share
- Advertisement -

Follow us on Facebook

Latest News

anti-spoofing tips
Anti-Spoofing is Crucial for Data-Driven Businesses
Security
ai in software development
3 AI-Based Strategies to Develop Software in Uncertain Times
Software
ai in ppc advertising
5 Proven Tips for Utilizing AI with PPC Advertising in 2023
Artificial Intelligence
data-driven image seo
Data Analytics Helps Marketers Substantially Boost Image SEO
Analytics

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

data visualization for small business
Data Visualization

Data Visualization Boosts Business Scalability with Sales Mapping

7 Min Read
programming languages for corporate database
ExclusiveProgrammingR Programming Language

Choosing the Right Programming Language for A Corporate Database

6 Min Read
big data visualization
Data Visualization

5 Best Practices for Extracting, Analyzing, and Visualizing Data

6 Min Read
Data Visualization
Data Visualization

10 Important Ways Data Visualization Can Benefit Your Content Strategy

13 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?