How can I use R (Rcurl/XML packages ?!) to scrap this webpage ?

Posted by Tal Galili on Stack Overflow See other posts from Stack Overflow or by Tal Galili
Published on 2010-03-14T18:03:23Z Indexed on 2010/03/14 18:05 UTC
Read the original article Hit count: 441

Filed under:

dna

Hi all,

I have a (somewhat complex) webscraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:

I would like to go through all the "species pages" present in this link:

http://gtrnadb.ucsc.edu/

So for each of them I will go to:

The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)

Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):

chr.trna3 (1-77)    Length: 77 bp
Type: Ala   Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....

Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)

I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.

Thanks for any help,

Tal

Related posts about webscraping

Webscraping Google tasks via Google Calendar

as seen on Stack Overflow - Search for 'Stack Overflow'
As gmail and the task api is not available everywhere (eg: some companies block gmail but not calendar), is there a way to scrap google task through the calendar web interface ? The solution can be to use jQuery/Jaxer or a pointer to a browser script/plugin. >>> More
R webscraping: interrogating for date and importance

as seen on Stack Overflow - Search for 'Stack Overflow'
I am able to webscrape a table from a webpage containing news library(XML) webpage <- "http://www.tradingeconomics.com/calendar" tables <- readHTMLTable(webpage ) n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) dfcal <- as.data.frame(tables$calendar) However I do not know… >>> More
How can I use R (Rcurl/XML packages ?!) to scrap this webpage ?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all, I have a (somewhat complex) webscraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes: I would like to go through all the "species pages" present in this link: http://gtrnadb.ucsc.edu/ So for each of them I will… >>> More
HTML Agility Pack Screen Scraping XPATH isn't returning data

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox… >>> More
Is there a jQuery webscraper out there?

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to pullout some info from an external site using jQuery and Adobe AIR. Right now I'm using a hidden div and jQuery's load function to load fragments of the external site, once the info is loaded I parse some info with selectors. This is fine but it's kinda dirty and I need to perform this… >>> More

Developer IT