Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics risk management
    How Predictive Analytics Is Redefining Risk Management Across Industries
    7 Min Read
    data analytics and gold trading
    Data Analytics and the New Era of Gold Trading
    9 Min Read
    composable analytics
    How Composable Analytics Unlocks Modular Agility for Data Teams
    9 Min Read
    data mining to find the right poly bag makers
    Using Data Analytics to Choose the Best Poly Mailer Bags
    12 Min Read
    data analytics for pharmacy trends
    How Data Analytics Is Tracking Trends in the Pharmacy Industry
    5 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Where to Find Data to Use with R
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > R Programming Language > Where to Find Data to Use with R
R Programming Language

Where to Find Data to Use with R

DavidMSmith
DavidMSmith
6 Min Read
SHARE

(Contributing blogger Joe Rickert has put together a fantastic list of data sources suitable for use with R. If you’re looking for data to use in the Applications of R Contest — entries close October 31 — this is a great resource for you — Ed.)

(Contributing blogger Joe Rickert has put together a fantastic list of data sources suitable for use with R. If you’re looking for data to use in the Applications of R Contest — entries close October 31 — this is a great resource for you — Ed.)

Hardly a day goes by without someone or something reminding me that we are drowning in a sea of data (a bummer day ):, or that the new hero is the data scientist (a Yes! let’s go make some money kind of day!!). This morning I read “…Google grew from processing 100 terrabytes of data a day with MapReduce in 2004 to processing 20 petabytes a day with MapReduce in 2008. (Lin and Dyer, Data-Intensive Text Processing with MapReduce: Morgan&Claypool, 2010 p1) Assuming linear growth, that would mean did about 400 terabytes during the 15 minutes it took me to check my email. Even if Google is getting more than its fair share, data should be everywhere, more data that I could ever need, more than I could process, more than I could ever imagine.

More Read

R Integrated Throughout the Enterprise Analytics Stack
How the New York Times uses R for Data Visualization
Oracle’s Big Data Appliance to include R
Brian Ripley on The R Development Process
ggplot2 for Big Data

So, how come every time I go to write a blog post or try some new stats I can never find any data?  A few hours ago I Googled “free data sets” and got over 74,000,000 hits, but it looks as if it’s going to be another evening of me with iris. What’s wrong here? At the root, it’s a deep problem that gets at the essence of data. What are data anyway? My answer: data are structured information. Part of the structure includes meta-information describing the intention and the integrity with which the data were collected. When looking for a data set, even for some purpose that is not that important we all want some evidence that the data were either collected with intentions that are similar to our intentions to use the data or that the data can be re-purposed. Moreover, we need to establish some comfort level that the data were not collected to deceive, that they are reasonable representative, reasonably randomized, reasonable unbiased etc. The more we importance we place on our project the more we tighten up on these requirements. This is not all philosophy. I think that focusing on intentions and integrity provides some practical guidance of where to search for data on the internet.

Historic financial data are relatively easy to find because the intentions with which they were collected are clear and Yahoo, Google, FRED, the St Louis Fed, Oanda for currency data and others have made it their business to collect and maintain these data. Visit quantmod for R code to read in data from these sites and even more places to find financial data. The next places high on the data intentions and integrity scale are the world’s government agencies; federal, regional and municipal. Data.gov, the official website of the United states Government, the Census Bureau, the Department of Energy, the FBI, and other agencies have interesting data sets to offer. The National Institute of Health even offers some data sets in R format. Here is a microarray data set.

Don’t just confine your search to the US. The UK is on a mission to open up the government. And, don’t just look at the federal level. Look here for a fairly clean data set on London (UK) municipal waste management and here for the dirt on a thousand or so NYC taxi complaints. The Guardian is trying to make it easy to surf the data from the worlds governments.

The sweet spots for data sets on varied topics having thousands to a few milion records are the professional data set aggregators such as infochimps , datamarket and datamob.org. Some of these sites offer data sets for sale as well as offering some free data. They all seem to do a good job of describing the data and get high intention and integrity scores. KDnuggets tracks data sets that are large enough to use for data mining projects.

For the very ambitious: check out the free data sets that are available for analysis in Amazon’s cloud.

Finally, there are several bloggers and wiki makers out there trying to build, annotate and maintain their own lists. Some of my favorites are at at blogspot and quora, the Januarist and Revolution. Stackexchange tracks data sources that have R interfaces. And, of course, there are those willing to make interesting data are available for a song.

I’m maintaining a list of public data sources at inside-R.org. Please let me know what I have missed.

TAGGED:data sets
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

street address database
Why Data-Driven Companies Rely on Accurate Street Address Databases
Big Data Exclusive
predictive analytics risk management
How Predictive Analytics Is Redefining Risk Management Across Industries
Analytics Exclusive Predictive Analytics
data analytics and gold trading
Data Analytics and the New Era of Gold Trading
Analytics Big Data Exclusive
student learning AI
Advanced Degrees Still Matter in an AI-Driven Job Market
Artificial Intelligence Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

rise of blockchain technology shaping big data
Big DataBlockchainData ManagementData QualityExclusivePrivacySecurity

What Does The Rise of Blockchain Technology Mean For Big Data?

6 Min Read

Voodoo Spectrum of Machine Learning and Data Sets

3 Min Read
edb files and databases
Big Data

Tips on Viewing EDB Files While Managing Scalable Data Sets

7 Min Read
analyzing big data for its quality and value
Big Data

Use this Strategic Approach to Maximize Your Data’s Value

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
ai chatbot
The Art of Conversation: Enhancing Chatbots with Advanced AI Prompts
Chatbots

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?