Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Big Data Sets You Can Use with R
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Big Data Sets You Can Use with R
Big DataSoftware

Big Data Sets You Can Use with R

DavidMSmith
DavidMSmith
9 Min Read
Image
SHARE

ImageThe world may indeed be awash with data, however, it is not always easy to find a suitable data set when you need one.

ImageThe world may indeed be awash with data, however, it is not always easy to find a suitable data set when you need one. As the number of people becoming involved with R and data science increases so does the need for interesting data sets for creating examples, showcasing machine learning algorithms and developing statistical analyses. The most difficult data sets to find are those that would provide the foundation for impressive big data examples: data sets with a 100 million rows and hundreds of variables.The problem with big data, however, is that most of it is proprietary and locked away. Consequently, when constructing examples it is often necessary “make do” with data sets that are considerably smaller than an analyst is likely to be faced with in practice. To help with this problem, we have added some new data sets to lists of data sets on inside-r.org that we began keeping since almost two years ago. So, if you are looking for a sample data set or if you are the kind of person who enjoys browsing data repositories as some people enjoy browsing bookstores have a look at what is available there. The following presents some of the highlights.

The Revolution Analytics collection contains some of the data sets we use at Revolution to show off the Parallel External Memory Algorithms in our RevoScaleR package. The collection includes easily accessible “tarred-up” versions of the Airlines Data Set, Census5PCT2000 data set and an artificial set of mortgage default data.

The Airlines data set that was used in the 2009 American Statistical Association challenge has become the “iris” data set for big data. This file contains information on US Domestic Flights between 1987 and 2008 and has some nice properties that make it useful for different kinds of analyses. It has over 123 million rows (observations) and 29 columns containing variables of different data types including factors with lots of levels. The following output from the RevoScaleR function rxGetInfo() displays basic information for the variables in the file.  

More Read

What Is Your Logistics Data Worth?
Leverage These Data-Driven Metrics to Choose the Best Hosting Provider
Political Revolutions on Twitter, Visualized with R
The Road to Self-Service BI
4 Ways Big Data Will Change Every Business
> rxGetInfoXdf(working.file,getVarInfo=TRUE) File name: C:\DATA\Airlines_87_08\BigAir3.xdf Number of observations: 123534969 Number of variables: 31 Number of blocks: 833 Variable information: Var 1: Year, Type: integer, Low/High: (1987, 2008) Var 2: Month 12 factor levels: January February March April May ... August September October November December Var 3: DayofMonth, Type: integer, Low/High: (1, 31) Var 4: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday Var 5: DepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 29.5000) Var 6: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000) Var 7: ArrTime, Type: numeric, Storage: float32, Low/High: (0.0167, 29.9167) Var 8: CRSArrTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000) Var 9: UniqueCarrier 29 factor levels: 9E AA AQ AS B6 ... UA US WN XE YV Var 10: FlightNum 8160 factor levels: 1 10 100 1000 1001 ... 995 996 997 998 999 Var 11: TailNum, Type: numeric, Storage: float32, Low/High: (0.0000, 715.0000) Var 12: ActualElapsedTime, Type: integer, Low/High: (-719, 1883) Var 13: CRSElapsedTime, Type: integer, Low/High: (-1240, 1613) Var 14: AirTime, Type: integer, Low/High: (-3818, 3508) Var 15: ArrDelay, Type: integer, Low/High: (-1437, 2598) Var 16: DepDelay, Type: integer, Low/High: (-1410, 2601) Var 17: Origin 347 factor levels: ABE ABI ABQ ABY ACK ... XNA YAK YAP YKM YUM Var 18: Dest 352 factor levels: ABE ABI ABQ ABY ACK ... XNA YAK YAP YKM YUM Var 19: Distance, Type: integer, Low/High: (0, 4983) Var 20: TaxiIn, Type: integer, Low/High: (0, 1523) Var 21: TaxiOut, Type: integer, Low/High: (0, 3905) Var 22: Cancelled, Type: logical, Low/High: (0, 1) Var 23: CancellationCode 5 factor levels: NA carrier weather NAS security Var 24: Diverted, Type: logical, Low/High: (0, 1) Var 25: CarrierDelay, Type: integer, Low/High: (0, 2580) Var 26: WeatherDelay, Type: integer, Low/High: (0, 1510) Var 27: NASDelay, Type: integer, Low/High: (-60, 1392) Var 28: SecurityDelay, Type: integer, Low/High: (0, 533) Var 29: LateAircraftDelay, Type: integer, Low/High: (0, 1407)

Created by Pretty R at inside-R.org

Note that the 22 .csv files that comprise the Airlines dataset are available on RITA, the FAA website, along with data for more recent time periods

A smaller, but still very useful file for machine learning applications, containing medicare data was used in an R-bloggers post highlighting bigglm and ffbase. This file contains almost 3 million rows and eleven variables.

Graham Williams and others (me included) have made good use of the small version of the Australian weather file in his rattle R package. However, in an appendix of his book Data Mining with Rattle and R, Grahm points the way to the Australian government site which makes the data available in what Hadley Wickham might call a “tidy” format. (The data are not “clean” but they are in good enough shape to work with.) The following chart was built with rattle from Canberra Data collected between March and July of this year. Code to access and clean the file a bit, based on code Graham provides in his book, is available here:  Download Code to clean weather data.

Weather_corr_plot

The moderately large airline “Edge” data set (3.5 million records) along with the airports and their locations data set, both available without charge from infochimps provided the occasion for a slightly more elaborate data shaping and cleaning effort using RevoScaleR functions. One way to do this is documented in the RevoScaleR Data Step White Paper.

As a final example of R friendly datasets have a look at those that Max Kuhn and Kjell Johnson have wrapped into the R package, AppliedPredictiveModeling, which they wrote to support their Springer book of the same name. This package offers a number of interesting small datasets including segmentationOriginal, which provides measurements on cell body features and has over 2,000 observations and 100 variables.

The data sets on the Inside-R list cover quite a bit of ground, however, I am sure that there is much more out there that should be on the list. We at Revolution Analytics would very much appreciate learning about what we have missed.  Many thanks to everyone who has provided data sets or contributed to the Inside-R list in some other way.

by Joseph Rickert

TAGGED:r
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Converting time zones in R: tips, tricks and pitfalls

9 Min Read

Gapminder: Animating the World’s Data

3 Min Read

R Foundation clarifies position on package licenses

3 Min Read

Physicists, models, and the credit crisis

3 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?