Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics
    How Data Analytics Can Help You Construct A Financial Weather Map
    4 Min Read
    financial analytics
    Financial Analytics Shows The Hidden Cost Of Not Switching Systems
    4 Min Read
    warehouse accidents
    Data Analytics and the Future of Warehouse Safety
    10 Min Read
    stock investing and data analytics
    How Data Analytics Supports Smarter Stock Trading Strategies
    4 Min Read
    predictive analytics risk management
    How Predictive Analytics Is Redefining Risk Management Across Industries
    7 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Big Data Sets You Can Use with R
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Big Data Sets You Can Use with R
Big DataSoftware

Big Data Sets You Can Use with R

DavidMSmith
DavidMSmith
9 Min Read
Image
SHARE

ImageThe world may indeed be awash with data, however, it is not always easy to find a suitable data set when you need one.

ImageThe world may indeed be awash with data, however, it is not always easy to find a suitable data set when you need one. As the number of people becoming involved with R and data science increases so does the need for interesting data sets for creating examples, showcasing machine learning algorithms and developing statistical analyses. The most difficult data sets to find are those that would provide the foundation for impressive big data examples: data sets with a 100 million rows and hundreds of variables.The problem with big data, however, is that most of it is proprietary and locked away. Consequently, when constructing examples it is often necessary “make do” with data sets that are considerably smaller than an analyst is likely to be faced with in practice. To help with this problem, we have added some new data sets to lists of data sets on inside-r.org that we began keeping since almost two years ago. So, if you are looking for a sample data set or if you are the kind of person who enjoys browsing data repositories as some people enjoy browsing bookstores have a look at what is available there. The following presents some of the highlights.

The Revolution Analytics collection contains some of the data sets we use at Revolution to show off the Parallel External Memory Algorithms in our RevoScaleR package. The collection includes easily accessible “tarred-up” versions of the Airlines Data Set, Census5PCT2000 data set and an artificial set of mortgage default data.

The Airlines data set that was used in the 2009 American Statistical Association challenge has become the “iris” data set for big data. This file contains information on US Domestic Flights between 1987 and 2008 and has some nice properties that make it useful for different kinds of analyses. It has over 123 million rows (observations) and 29 columns containing variables of different data types including factors with lots of levels. The following output from the RevoScaleR function rxGetInfo() displays basic information for the variables in the file.  

More Read

big data in the gig economy
Big Data Creates Numerous New Perks in the Gig Economy
The New Mainstream Appeal of Apache Spark
Why Investing in Data Is Crucial for Business Growth In 2022
6 Tremendous Benefits of Big Data for Financial Management
Does Data Analytics Help With Selecting An IT Provider?
> rxGetInfoXdf(working.file,getVarInfo=TRUE) File name: C:\DATA\Airlines_87_08\BigAir3.xdf Number of observations: 123534969 Number of variables: 31 Number of blocks: 833 Variable information: Var 1: Year, Type: integer, Low/High: (1987, 2008) Var 2: Month 12 factor levels: January February March April May ... August September October November December Var 3: DayofMonth, Type: integer, Low/High: (1, 31) Var 4: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday Var 5: DepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 29.5000) Var 6: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000) Var 7: ArrTime, Type: numeric, Storage: float32, Low/High: (0.0167, 29.9167) Var 8: CRSArrTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000) Var 9: UniqueCarrier 29 factor levels: 9E AA AQ AS B6 ... UA US WN XE YV Var 10: FlightNum 8160 factor levels: 1 10 100 1000 1001 ... 995 996 997 998 999 Var 11: TailNum, Type: numeric, Storage: float32, Low/High: (0.0000, 715.0000) Var 12: ActualElapsedTime, Type: integer, Low/High: (-719, 1883) Var 13: CRSElapsedTime, Type: integer, Low/High: (-1240, 1613) Var 14: AirTime, Type: integer, Low/High: (-3818, 3508) Var 15: ArrDelay, Type: integer, Low/High: (-1437, 2598) Var 16: DepDelay, Type: integer, Low/High: (-1410, 2601) Var 17: Origin 347 factor levels: ABE ABI ABQ ABY ACK ... XNA YAK YAP YKM YUM Var 18: Dest 352 factor levels: ABE ABI ABQ ABY ACK ... XNA YAK YAP YKM YUM Var 19: Distance, Type: integer, Low/High: (0, 4983) Var 20: TaxiIn, Type: integer, Low/High: (0, 1523) Var 21: TaxiOut, Type: integer, Low/High: (0, 3905) Var 22: Cancelled, Type: logical, Low/High: (0, 1) Var 23: CancellationCode 5 factor levels: NA carrier weather NAS security Var 24: Diverted, Type: logical, Low/High: (0, 1) Var 25: CarrierDelay, Type: integer, Low/High: (0, 2580) Var 26: WeatherDelay, Type: integer, Low/High: (0, 1510) Var 27: NASDelay, Type: integer, Low/High: (-60, 1392) Var 28: SecurityDelay, Type: integer, Low/High: (0, 533) Var 29: LateAircraftDelay, Type: integer, Low/High: (0, 1407)

Created by Pretty R at inside-R.org

Note that the 22 .csv files that comprise the Airlines dataset are available on RITA, the FAA website, along with data for more recent time periods

A smaller, but still very useful file for machine learning applications, containing medicare data was used in an R-bloggers post highlighting bigglm and ffbase. This file contains almost 3 million rows and eleven variables.

Graham Williams and others (me included) have made good use of the small version of the Australian weather file in his rattle R package. However, in an appendix of his book Data Mining with Rattle and R, Grahm points the way to the Australian government site which makes the data available in what Hadley Wickham might call a “tidy” format. (The data are not “clean” but they are in good enough shape to work with.) The following chart was built with rattle from Canberra Data collected between March and July of this year. Code to access and clean the file a bit, based on code Graham provides in his book, is available here:  Download Code to clean weather data.

Weather_corr_plot

The moderately large airline “Edge” data set (3.5 million records) along with the airports and their locations data set, both available without charge from infochimps provided the occasion for a slightly more elaborate data shaping and cleaning effort using RevoScaleR functions. One way to do this is documented in the RevoScaleR Data Step White Paper.

As a final example of R friendly datasets have a look at those that Max Kuhn and Kjell Johnson have wrapped into the R package, AppliedPredictiveModeling, which they wrote to support their Springer book of the same name. This package offers a number of interesting small datasets including segmentationOriginal, which provides measurements on cell body features and has over 2,000 observations and 100 variables.

The data sets on the Inside-R list cover quite a bit of ground, however, I am sure that there is much more out there that should be on the list. We at Revolution Analytics would very much appreciate learning about what we have missed.  Many thanks to everyone who has provided data sets or contributed to the Inside-R list in some other way.

by Joseph Rickert

TAGGED:r
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

data analytics
How Data Analytics Can Help You Construct A Financial Weather Map
Analytics Exclusive Infographic
AI use in payment methods
AI Shows How Payment Delays Disrupt Your Business
Artificial Intelligence Exclusive Infographic
financial analytics
Financial Analytics Shows The Hidden Cost Of Not Switching Systems
Analytics Exclusive Infographic
multi model ai
How Teams Using Multi-Model AI Reduced Risk Without Slowing Innovation
Artificial Intelligence Exclusive

Stay Connected

1.2KFollowersLike
33.7KFollowersFollow
222FollowersPin

You Might also Like

Learning R

8 Min Read

Tabled: Is R the Solution?

12 Min Read

Interactive stock visualizations with R

3 Min Read

Open Source Analytics Reaches Main Street (and Some Other Trends in Analytics)

8 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?