How to isolate a single element from a scraped web page in R

Posted by PaulHurleyuk on Stack Overflow See other posts from Stack Overflow or by PaulHurleyuk
Published on 2010-06-08T15:14:21Z Indexed on 2010/06/08 17:32 UTC
Read the original article Hit count: 401

Filed under:

Xml

|

r

|

webscraping

|

rcurl

Hello,

I'm trying to do soemone a favour, and it's a tad outside my comfort zone, so I'm stuck.

I want to use R to scrape this page (http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) and others, to get the goal scorers and times.

So far, this is what I've got

require(RCurl)
require(XML)

theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
webpage <- getURL(theURL, header=FALSE, verbose=TRUE) 
webpagecont <- readLines(tc <- textConnection(webpage)); close(tc)  

pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE)

and the pagetree object now contains a pointer to my parsed html (I think). The part I want is

<div class="cont")<ul>
<div class="bold medium">Goals scored</div>
        <li>Philipp LAHM (GER) 6', </li>
        <li>Paulo WANCHOPE (CRC) 12', </li>
        <li>Miroslav KLOSE (GER) 17', </li>
        <li>Miroslav KLOSE (GER) 61', </li>
        <li>Paulo WANCHOPE (CRC) 73', </li>
        <li>Torsten FRINGS (GER) 87'</li>
</ul></div>

but I'm now lost as to how to isolate them, and frankly xpathSApply, xpathApply confuse the beejeebies out of me !.

So, does anyone know how to fomulate a command to suck out the element conmtaiend within the tags ?

Thanks

Paul.

Related posts about Xml

Store XML,update record in XML,retrive a specific record in XML stored on BB device

as seen on Stack Overflow - Search for 'Stack Overflow'
I am writing a blackberry application where i want to store the data returned by a web service in my BB device.Earlier i was going to use SQLite for storing the data in mobile but as i googled and also did programming using SQLite and found that some BB devices dont support SQLite library and fail… >>> More
gwt+xml- can i read through incomplete XML using the GWT XML Parser

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a requirement where a user is typing in XML in a text area, and I want to show the various nodes in a tree...But as the user is typing in the xml, it wont be a complete xml (since he is still typing in the XML)... How do I read an incomplete XML and correctly generate the tree? I understand… >>> More
perl xml parser get xml content within xml

as seen on Stack Overflow - Search for 'Stack Overflow'
How can I use XMLParser to get the item-@url, item-@replace and item-"value inside" for the content as a string of the node where item-@cone="one"? <cstep> <item cone="one" url="http://google.com/{ccc}/cthree" replace="{ccc}"> <itemsub conesub="conesub"> … >>> More
Reading php generated XML in flash?

as seen on Stack Overflow - Search for 'Stack Overflow'
Here is part 1 of our problem (Loading a dynamically generated XML file as PHP in Flash). Now we were able to get Flash to read the XML file, but we can only see the Flash render correctly when tested(test movie) from the actual Flash program. However, when we upload our files online to preview the… >>> More
Announcing RSS feeds of Microsoft All-In-One Code Framework code samples

as seen on Geeks with Blogs - Search for 'Geeks with Blogs'
Today, we are not only announcing Sample Browser v2 CTP, but we are also excited to announce the availability of RSS feeds of All-In-One Code Framework code samples. By using these feeds, you can easily track and download the new code samples. English RSS feeds All code samples: http://support… >>> More

Developer IT