Scraping data from the Web with R
Sometimes the data we need isn’t packaged up nicely into a simple comma-separated file or database. It’s out there, but only in unstructured (or semi-structured) form: displayed as a table on a Web page, for example. With the RCurl package, some regular expressions, and a little knowledge of HTML, it’s possible to extract (or scrape) the structured data you need. Programming R gives a simple example of scraping the r-help listserv archives to tabulate the most prolific posters. (Incidentally, the same techniques are used in the O’Reilly Short Cut Data Mashups with R, in the context of a much more detailed example.)
Web-scraping should always be a last resort — you’re always at the mercy of the site owner tweaking the format and breaking your code, and many sites frown on the practice even if your use of the data is legit. But it’s often a useful way of getting to public data sources as a one-off activity, or where the format has been static for a long time.
Programming R: Webscraping using readLines and RCurl (via)
Link to original post
You may be interested
How SAP Hana is Driving Big Data StartupsRyan Kh - July 20, 2017
The first version of SAP Hana was released in 2010, before Hadoop and other big data extraction tools were introduced.…
Data Erasing Software vs Physical Destruction: Sustainable Way of Data DeletionManish Bhickta - July 20, 2017
Physical Data destruction techniques are efficient enough to destroy data, but they can never be considered eco-friendly. On the other…
10 Simple Rules for Creating a Good Data Management PlanGloriaKopp - July 20, 2017
Part of business planning is arranging how data will be used in the development of a project. This is why…
You must log in to post a comment.