Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics risk management
    How Predictive Analytics Is Redefining Risk Management Across Industries
    7 Min Read
    data analytics and gold trading
    Data Analytics and the New Era of Gold Trading
    9 Min Read
    composable analytics
    How Composable Analytics Unlocks Modular Agility for Data Teams
    9 Min Read
    data mining to find the right poly bag makers
    Using Data Analytics to Choose the Best Poly Mailer Bags
    12 Min Read
    data analytics for pharmacy trends
    How Data Analytics Is Tracking Trends in the Pharmacy Industry
    5 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Where to Find Data to Use with R
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > R Programming Language > Where to Find Data to Use with R
R Programming Language

Where to Find Data to Use with R

DavidMSmith
DavidMSmith
6 Min Read
SHARE

(Contributing blogger Joe Rickert has put together a fantastic list of data sources suitable for use with R. If you’re looking for data to use in the Applications of R Contest — entries close October 31 — this is a great resource for you — Ed.)

(Contributing blogger Joe Rickert has put together a fantastic list of data sources suitable for use with R. If you’re looking for data to use in the Applications of R Contest — entries close October 31 — this is a great resource for you — Ed.)

Hardly a day goes by without someone or something reminding me that we are drowning in a sea of data (a bummer day ):, or that the new hero is the data scientist (a Yes! let’s go make some money kind of day!!). This morning I read “…Google grew from processing 100 terrabytes of data a day with MapReduce in 2004 to processing 20 petabytes a day with MapReduce in 2008. (Lin and Dyer, Data-Intensive Text Processing with MapReduce: Morgan&Claypool, 2010 p1) Assuming linear growth, that would mean did about 400 terabytes during the 15 minutes it took me to check my email. Even if Google is getting more than its fair share, data should be everywhere, more data that I could ever need, more than I could process, more than I could ever imagine.

More Read

3 Hours of Pure Soccer Emotion, Visualized with R
Vector Computing, Who Is More Powerful, R Language or esProc?
The Fallacy of the Data Scientist Shortage
What Angry Birds Can Teach Us About Analytics
GigaOm article on R, Big Data and Data Science

So, how come every time I go to write a blog post or try some new stats I can never find any data?  A few hours ago I Googled “free data sets” and got over 74,000,000 hits, but it looks as if it’s going to be another evening of me with iris. What’s wrong here? At the root, it’s a deep problem that gets at the essence of data. What are data anyway? My answer: data are structured information. Part of the structure includes meta-information describing the intention and the integrity with which the data were collected. When looking for a data set, even for some purpose that is not that important we all want some evidence that the data were either collected with intentions that are similar to our intentions to use the data or that the data can be re-purposed. Moreover, we need to establish some comfort level that the data were not collected to deceive, that they are reasonable representative, reasonably randomized, reasonable unbiased etc. The more we importance we place on our project the more we tighten up on these requirements. This is not all philosophy. I think that focusing on intentions and integrity provides some practical guidance of where to search for data on the internet.

Historic financial data are relatively easy to find because the intentions with which they were collected are clear and Yahoo, Google, FRED, the St Louis Fed, Oanda for currency data and others have made it their business to collect and maintain these data. Visit quantmod for R code to read in data from these sites and even more places to find financial data. The next places high on the data intentions and integrity scale are the world’s government agencies; federal, regional and municipal. Data.gov, the official website of the United states Government, the Census Bureau, the Department of Energy, the FBI, and other agencies have interesting data sets to offer. The National Institute of Health even offers some data sets in R format. Here is a microarray data set.

Don’t just confine your search to the US. The UK is on a mission to open up the government. And, don’t just look at the federal level. Look here for a fairly clean data set on London (UK) municipal waste management and here for the dirt on a thousand or so NYC taxi complaints. The Guardian is trying to make it easy to surf the data from the worlds governments.

The sweet spots for data sets on varied topics having thousands to a few milion records are the professional data set aggregators such as infochimps , datamarket and datamob.org. Some of these sites offer data sets for sale as well as offering some free data. They all seem to do a good job of describing the data and get high intention and integrity scores. KDnuggets tracks data sets that are large enough to use for data mining projects.

For the very ambitious: check out the free data sets that are available for analysis in Amazon’s cloud.

Finally, there are several bloggers and wiki makers out there trying to build, annotate and maintain their own lists. Some of my favorites are at at blogspot and quora, the Januarist and Revolution. Stackexchange tracks data sources that have R interfaces. And, of course, there are those willing to make interesting data are available for a song.

I’m maintaining a list of public data sources at inside-R.org. Please let me know what I have missed.

TAGGED:data sets
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

predictive analytics risk management
How Predictive Analytics Is Redefining Risk Management Across Industries
Analytics Exclusive Predictive Analytics
data analytics and gold trading
Data Analytics and the New Era of Gold Trading
Analytics Big Data Exclusive
student learning AI
Advanced Degrees Still Matter in an AI-Driven Job Market
Artificial Intelligence Exclusive
mobile device farm
How Mobile Device Farms Strengthen Big Data Workflows
Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

edb files and databases
Big Data

Tips on Viewing EDB Files While Managing Scalable Data Sets

7 Min Read

Voodoo Spectrum of Machine Learning and Data Sets

3 Min Read
rise of blockchain technology shaping big data
Big DataBlockchainData ManagementData QualityExclusivePrivacySecurity

What Does The Rise of Blockchain Technology Mean For Big Data?

6 Min Read
analyzing big data for its quality and value
Big Data

Use this Strategic Approach to Maximize Your Data’s Value

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?