By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data Analytics instagram stories
    Data Analytics Helps Marketers Make the Most of Instagram Stories
    15 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    What to Know Before Recruiting an Analyst to Handle Company Data
    6 Min Read
    AI analytics
    AI-Based Analytics Are Changing the Future of Credit Cards
    6 Min Read
    data overload showing data analytics
    How Does Next-Gen SIEM Prevent Data Overload For Security Analysts?
    8 Min Read
    hire a marketing agency with a background in data analytics
    5 Reasons to Hire a Marketing Agency that Knows Data Analytics
    7 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Big Data Sets You Can Use with R
Share
Notification Show More
Aa
SmartData CollectiveSmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Big Data Sets You Can Use with R
Big DataSoftware

Big Data Sets You Can Use with R

DavidMSmith
Last updated: 2013/08/23 at 8:00 AM
DavidMSmith
9 Min Read
Image
SHARE

ImageThe world may indeed be awash with data, however, it is not always easy to find a suitable data set when you need one.

ImageThe world may indeed be awash with data, however, it is not always easy to find a suitable data set when you need one. As the number of people becoming involved with R and data science increases so does the need for interesting data sets for creating examples, showcasing machine learning algorithms and developing statistical analyses. The most difficult data sets to find are those that would provide the foundation for impressive big data examples: data sets with a 100 million rows and hundreds of variables.The problem with big data, however, is that most of it is proprietary and locked away. Consequently, when constructing examples it is often necessary “make do” with data sets that are considerably smaller than an analyst is likely to be faced with in practice. To help with this problem, we have added some new data sets to lists of data sets on inside-r.org that we began keeping since almost two years ago. So, if you are looking for a sample data set or if you are the kind of person who enjoys browsing data repositories as some people enjoy browsing bookstores have a look at what is available there. The following presents some of the highlights.

The Revolution Analytics collection contains some of the data sets we use at Revolution to show off the Parallel External Memory Algorithms in our RevoScaleR package. The collection includes easily accessible “tarred-up” versions of the Airlines Data Set, Census5PCT2000 data set and an artificial set of mortgage default data.

The Airlines data set that was used in the 2009 American Statistical Association challenge has become the “iris” data set for big data. This file contains information on US Domestic Flights between 1987 and 2008 and has some nice properties that make it useful for different kinds of analyses. It has over 123 million rows (observations) and 29 columns containing variables of different data types including factors with lots of levels. The following output from the RevoScaleR function rxGetInfo() displays basic information for the variables in the file.  

More Read

Visualizing Katrina’s Strongest Winds with R

Demand for R Jobs on the Rise, While SAS Jobs Decline
What is R? A New Video on the History, Community and Applications of R [VIDEO]
Forex Trading with R : Part 2
Why Learn R? It’s the language of Statistics
> rxGetInfoXdf(working.file,getVarInfo=TRUE) File name: C:\DATA\Airlines_87_08\BigAir3.xdf Number of observations: 123534969 Number of variables: 31 Number of blocks: 833 Variable information: Var 1: Year, Type: integer, Low/High: (1987, 2008) Var 2: Month 12 factor levels: January February March April May ... August September October November December Var 3: DayofMonth, Type: integer, Low/High: (1, 31) Var 4: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday Var 5: DepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 29.5000) Var 6: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000) Var 7: ArrTime, Type: numeric, Storage: float32, Low/High: (0.0167, 29.9167) Var 8: CRSArrTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000) Var 9: UniqueCarrier 29 factor levels: 9E AA AQ AS B6 ... UA US WN XE YV Var 10: FlightNum 8160 factor levels: 1 10 100 1000 1001 ... 995 996 997 998 999 Var 11: TailNum, Type: numeric, Storage: float32, Low/High: (0.0000, 715.0000) Var 12: ActualElapsedTime, Type: integer, Low/High: (-719, 1883) Var 13: CRSElapsedTime, Type: integer, Low/High: (-1240, 1613) Var 14: AirTime, Type: integer, Low/High: (-3818, 3508) Var 15: ArrDelay, Type: integer, Low/High: (-1437, 2598) Var 16: DepDelay, Type: integer, Low/High: (-1410, 2601) Var 17: Origin 347 factor levels: ABE ABI ABQ ABY ACK ... XNA YAK YAP YKM YUM Var 18: Dest 352 factor levels: ABE ABI ABQ ABY ACK ... XNA YAK YAP YKM YUM Var 19: Distance, Type: integer, Low/High: (0, 4983) Var 20: TaxiIn, Type: integer, Low/High: (0, 1523) Var 21: TaxiOut, Type: integer, Low/High: (0, 3905) Var 22: Cancelled, Type: logical, Low/High: (0, 1) Var 23: CancellationCode 5 factor levels: NA carrier weather NAS security Var 24: Diverted, Type: logical, Low/High: (0, 1) Var 25: CarrierDelay, Type: integer, Low/High: (0, 2580) Var 26: WeatherDelay, Type: integer, Low/High: (0, 1510) Var 27: NASDelay, Type: integer, Low/High: (-60, 1392) Var 28: SecurityDelay, Type: integer, Low/High: (0, 533) Var 29: LateAircraftDelay, Type: integer, Low/High: (0, 1407)

Created by Pretty R at inside-R.org

Note that the 22 .csv files that comprise the Airlines dataset are available on RITA, the FAA website, along with data for more recent time periods

A smaller, but still very useful file for machine learning applications, containing medicare data was used in an R-bloggers post highlighting bigglm and ffbase. This file contains almost 3 million rows and eleven variables.

Graham Williams and others (me included) have made good use of the small version of the Australian weather file in his rattle R package. However, in an appendix of his book Data Mining with Rattle and R, Grahm points the way to the Australian government site which makes the data available in what Hadley Wickham might call a “tidy” format. (The data are not “clean” but they are in good enough shape to work with.) The following chart was built with rattle from Canberra Data collected between March and July of this year. Code to access and clean the file a bit, based on code Graham provides in his book, is available here:  Download Code to clean weather data.

Weather_corr_plot

The moderately large airline “Edge” data set (3.5 million records) along with the airports and their locations data set, both available without charge from infochimps provided the occasion for a slightly more elaborate data shaping and cleaning effort using RevoScaleR functions. One way to do this is documented in the RevoScaleR Data Step White Paper.

As a final example of R friendly datasets have a look at those that Max Kuhn and Kjell Johnson have wrapped into the R package, AppliedPredictiveModeling, which they wrote to support their Springer book of the same name. This package offers a number of interesting small datasets including segmentationOriginal, which provides measurements on cell body features and has over 2,000 observations and 100 variables.

The data sets on the Inside-R list cover quite a bit of ground, however, I am sure that there is much more out there that should be on the list. We at Revolution Analytics would very much appreciate learning about what we have missed.  Many thanks to everyone who has provided data sets or contributed to the Inside-R list in some other way.

by Joseph Rickert

TAGGED: r
DavidMSmith August 23, 2013
Share This Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai low code frameworks
AI Can Help Accelerate Development with Low-Code Frameworks
Artificial Intelligence
data Analytics instagram stories
Data Analytics Helps Marketers Make the Most of Instagram Stories
Analytics
data breaches
How Hospital Security Breaches Devastate Local Communities
Policy and Governance
analyst,women,looking,at,kpi,data,on,computer,screen
What to Know Before Recruiting an Analyst to Handle Company Data
Analytics

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

Visualizing Katrina’s Strongest Winds with R

1 Min Read

Demand for R Jobs on the Rise, While SAS Jobs Decline

1 Min Read

What is R? A New Video on the History, Community and Applications of R [VIDEO]

2 Min Read

Forex Trading with R : Part 2

4 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?