Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Where to Find Data to Use with R
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > R Programming Language > Where to Find Data to Use with R
R Programming Language

Where to Find Data to Use with R

DavidMSmith
DavidMSmith
6 Min Read
SHARE

(Contributing blogger Joe Rickert has put together a fantastic list of data sources suitable for use with R. If you’re looking for data to use in the Applications of R Contest — entries close October 31 — this is a great resource for you — Ed.)

(Contributing blogger Joe Rickert has put together a fantastic list of data sources suitable for use with R. If you’re looking for data to use in the Applications of R Contest — entries close October 31 — this is a great resource for you — Ed.)

Hardly a day goes by without someone or something reminding me that we are drowning in a sea of data (a bummer day ):, or that the new hero is the data scientist (a Yes! let’s go make some money kind of day!!). This morning I read “…Google grew from processing 100 terrabytes of data a day with MapReduce in 2004 to processing 20 petabytes a day with MapReduce in 2008. (Lin and Dyer, Data-Intensive Text Processing with MapReduce: Morgan&Claypool, 2010 p1) Assuming linear growth, that would mean did about 400 terabytes during the 15 minutes it took me to check my email. Even if Google is getting more than its fair share, data should be everywhere, more data that I could ever need, more than I could process, more than I could ever imagine.

More Read

analyzing big data for its quality and value
Use this Strategic Approach to Maximize Your Data’s Value
Take the Predictive Analytics in the Cloud survey
NCAA Data Visualizer for March Madness Face-Offs
Benchmarking bigglm
Why We Need to Deal with Big Data in R

So, how come every time I go to write a blog post or try some new stats I can never find any data?  A few hours ago I Googled “free data sets” and got over 74,000,000 hits, but it looks as if it’s going to be another evening of me with iris. What’s wrong here? At the root, it’s a deep problem that gets at the essence of data. What are data anyway? My answer: data are structured information. Part of the structure includes meta-information describing the intention and the integrity with which the data were collected. When looking for a data set, even for some purpose that is not that important we all want some evidence that the data were either collected with intentions that are similar to our intentions to use the data or that the data can be re-purposed. Moreover, we need to establish some comfort level that the data were not collected to deceive, that they are reasonable representative, reasonably randomized, reasonable unbiased etc. The more we importance we place on our project the more we tighten up on these requirements. This is not all philosophy. I think that focusing on intentions and integrity provides some practical guidance of where to search for data on the internet.

Historic financial data are relatively easy to find because the intentions with which they were collected are clear and Yahoo, Google, FRED, the St Louis Fed, Oanda for currency data and others have made it their business to collect and maintain these data. Visit quantmod for R code to read in data from these sites and even more places to find financial data. The next places high on the data intentions and integrity scale are the world’s government agencies; federal, regional and municipal. Data.gov, the official website of the United states Government, the Census Bureau, the Department of Energy, the FBI, and other agencies have interesting data sets to offer. The National Institute of Health even offers some data sets in R format. Here is a microarray data set.

Don’t just confine your search to the US. The UK is on a mission to open up the government. And, don’t just look at the federal level. Look here for a fairly clean data set on London (UK) municipal waste management and here for the dirt on a thousand or so NYC taxi complaints. The Guardian is trying to make it easy to surf the data from the worlds governments.

The sweet spots for data sets on varied topics having thousands to a few milion records are the professional data set aggregators such as infochimps , datamarket and datamob.org. Some of these sites offer data sets for sale as well as offering some free data. They all seem to do a good job of describing the data and get high intention and integrity scores. KDnuggets tracks data sets that are large enough to use for data mining projects.

For the very ambitious: check out the free data sets that are available for analysis in Amazon’s cloud.

Finally, there are several bloggers and wiki makers out there trying to build, annotate and maintain their own lists. Some of my favorites are at at blogspot and quora, the Januarist and Revolution. Stackexchange tracks data sources that have R interfaces. And, of course, there are those willing to make interesting data are available for a song.

I’m maintaining a list of public data sources at inside-R.org. Please let me know what I have missed.

TAGGED:data sets
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

rise of blockchain technology shaping big data
Big DataBlockchainData ManagementData QualityExclusivePrivacySecurity

What Does The Rise of Blockchain Technology Mean For Big Data?

6 Min Read

Voodoo Spectrum of Machine Learning and Data Sets

3 Min Read
edb files and databases
Big Data

Tips on Viewing EDB Files While Managing Scalable Data Sets

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?