Search Results

Search found 6 results on 1 pages for 'rcurl'.

Page 1/1 | 1

rcurl web scraping timeout exits program

- by user1742368

I am using a loop and rcurl scrape data from multiple pages which seems to work fine at certain times but fails when there is a timeout due to the server not responding. I am using a timeout=30 which traps the timeout error however the program stops after the timeout. i would like the progrm to continue to the next page when the timeout occurrs but cant figureout how to do this? url = getCurlHandle(cookiefile = "", verbose = TRUE) Here is the statement I am using that causes the timeout. I am happy to share the code if there is interest. webpage = getURLContent(url, followlocation=TRUE, curl = curl,.opts=list( verbose = TRUE, timeout=90, maxredirs = 2)) woodwardjj

Read the article
R: extracting "clean" UTF-8 text from a web page scraped with RCurl

- by SlowLearner

Using R, I am trying to scrape a web page save the text, which is in Japanese, to a file. Ultimately this needs to be scaled to tackle hundreds of pages on a daily basis. I already have a workable solution in Perl, but I am trying to migrate the script to R to reduce the cognitive load of switching between multiple languages. So far I am not succeeding. Related questions seem to be this one on saving csv files and this one on writing Hebrew to a HTML file. However, I haven't been successful in cobbling together a solution based on the answers there. The pages are from Yahoo! Japan Finance and my Perl code that looks like this. use strict; use HTML::Tree; use LWP::Simple; #use Encode; use utf8; binmode STDOUT, ":utf8"; my @arr_links = (); $arr_links[1] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"; $arr_links[2] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"; foreach my $link (@arr_links){ $link =~ s/"//gi; print("$link\n"); my $content = get($link); my $tree = HTML::Tree->new(); $tree->parse($content); my $bar = $tree->as_text; open OUTFILE, ">>:utf8", join("","c:/", substr($link, -4),"_perl.txt") || die; print OUTFILE $bar; } This Perl script produces a CSV file that looks like the screenshot below, with proper kanji and kana that can be mined and manipulated offline: My R code, such as it is, looks like the following. The R script is not an exact duplicate of the Perl solution just given, as it doesn't strip out the HTML and leave the text (this answer suggests an approach using R but it doesn't work for me in this case) and it doesn't have the loop and so on, but the intent is the same. require(RCurl) require(XML) links <- list() links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203" links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201" txt <- getURL(links, .encoding = "UTF-8") Encoding(txt) <- "bytes" write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8") This R script generates the output shown in the screenshot below. Basically rubbish. I assume that there is some combination of HTML, text and file encoding that will allow me to generate in R a similar result to that of the Perl solution but I cannot find it. The header of the HTML page I'm trying to scrape says the chartset is utf-8 and I have set the encoding in the getURL call and in the write.table function to utf-8, but this alone isn't enough. The question How can I scrape the above web page using R and save the text as CSV in "well-formed" Japanese text rather than something that looks like line noise? Edit: I have added a further screenshot to show what happens when I omit the Encoding step. I get what look like Unicode codes, but not the graphical representation of the characters. So it may be some kind of locale-related issue, but in the exact same locale the Perl script does provide useful output. So this is still puzzling.

Read the article
How to isolate a single element from a scraped web page in R

- by PaulHurleyuk

Hello, I'm trying to do soemone a favour, and it's a tad outside my comfort zone, so I'm stuck. I want to use R to scrape this page (http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) and others, to get the goal scorers and times. So far, this is what I've got require(RCurl) require(XML) theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html" webpage <- getURL(theURL, header=FALSE, verbose=TRUE) webpagecont <- readLines(tc <- textConnection(webpage)); close(tc) pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE) and the pagetree object now contains a pointer to my parsed html (I think). The part I want is <div class="cont")<ul> <div class="bold medium">Goals scored</div> <li>Philipp LAHM (GER) 6', </li> <li>Paulo WANCHOPE (CRC) 12', </li> <li>Miroslav KLOSE (GER) 17', </li> <li>Miroslav KLOSE (GER) 61', </li> <li>Paulo WANCHOPE (CRC) 73', </li> <li>Torsten FRINGS (GER) 87'</li> </ul></div> but I'm now lost as to how to isolate them, and frankly xpathSApply, xpathApply confuse the beejeebies out of me !. So, does anyone know how to fomulate a command to suck out the element conmtaiend within the tags ? Thanks Paul.

Read the article
How can I use R (Rcurl/XML packages ?!) to scrap this webpage ?

- by Tal Galili

Hi all, I have a (somewhat complex) webscraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes: I would like to go through all the "species pages" present in this link: http://gtrnadb.ucsc.edu/ So for each of them I will go to: The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/) And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html) Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example): chr.trna3 (1-77) Length: 77 bp Type: Ala Anticodon: CGC at 35-37 (35-37) Score: 93.45 Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.... Where each line will have it's own list (inside the list for each "trna" inside the list for each animal) I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task. Thanks for any help, Tal

Read the article
getURL, parsing web-site with german special characters

- by Kay

I am using getURL() and htmlParse() - how can I make web-site content with special characters to be displayed properly? library(RCurl); library(XML) script <- getURL("http://www.floraweb.de/pflanzenarten/foto.xsql?suchnr=814") doc <- htmlParse(script, encoding = "UTF-8") xpathSApply(doc, "//div[@id='content']//p", xmlValue)[2] [1] "Bellis perennis L., GÃ¤nseblÃ¼mchen" # should say: [1] "Bellis perennis L., Gänseblümchen" > Sys.getlocale() [1] "LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252"

Read the article
Slow Python HTTP server on localhost

- by Abiel

I am experiencing some performance problems when creating a very simple Python HTTP server. The key issue is that performance is varying depending on which client I use to access it, where the server and all clients are being run on the local machine. For instance, a GET request issued from a Python script (urllib2.urlopen('http://localhost/').read()) takes just over a second to complete, which seems slow considering that the server is under no load. Running the GET request from Excel using MSXML2.ServerXMLHTTP also feels slow. However, requesting the data Google Chrome or from RCurl, the curl add-in for R, yields an essentially instantaneous response, which is what I would expect. Adding further to my confusion is that I do not experience any performance problems for any client when I am on my computer at work (the performance problems are on my home computer). Both systems run Python 2.6, although the work computer runs Windows XP instead of 7. Below is my very simple server example, which simply returns 'Hello world' for any get request. from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer class MyHandler(BaseHTTPRequestHandler): def do_GET(self): print("Just received a GET request") self.send_response(200) self.send_header("Content-type", "text/html") self.end_headers() self.wfile.write('Hello world') return def log_request(self, code=None, size=None): print('Request') def log_message(self, format, *args): print('Message') if __name__ == "__main__": try: server = HTTPServer(('localhost', 80), MyHandler) print('Started http server') server.serve_forever() except KeyboardInterrupt: print('^C received, shutting down server') server.socket.close() Note that in MyHandler I override the log_request() and log_message() functions. The reason is that I read that a fully-qualified domain name lookup performed by one of these functions might be a reason for a slow server. Unfortunately setting them to just print a static message did not solve my problem. Also, notice that I have put in a print() statement as the first line of the do_GET() routine in MyHandler. The slowness occurs prior to this message being printed, meaning that none of the stuff that comes after it is causing a delay.

Read the article

1