R: extracting "clean" UTF-8 text from a web page scraped with RCurl

Posted by SlowLearner on Stack Overflow See other posts from Stack Overflow or by SlowLearner
Published on 2012-06-17T08:19:26Z Indexed on 2012/06/18 3:16 UTC
Read the original article Hit count: 313

Filed under:

Using R, I am trying to scrape a web page save the text, which is in Japanese, to a file. Ultimately this needs to be scaled to tackle hundreds of pages on a daily basis. I already have a workable solution in Perl, but I am trying to migrate the script to R to reduce the cognitive load of switching between multiple languages. So far I am not succeeding. Related questions seem to be this one on saving csv files and this one on writing Hebrew to a HTML file. However, I haven't been successful in cobbling together a solution based on the answers there.

The pages are from Yahoo! Japan Finance and my Perl code that looks like this.

use strict;
use HTML::Tree;
use LWP::Simple;
#use Encode;
use utf8;

binmode STDOUT, ":utf8";

my @arr_links = ();
$arr_links[1] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203";
$arr_links[2] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201";

foreach my $link (@arr_links){
    $link =~ s/"//gi;
    print("$link\n");
    my $content = get($link);
    my $tree = HTML::Tree->new();
    $tree->parse($content);
    my $bar = $tree->as_text;
    open OUTFILE, ">>:utf8", join("","c:/", substr($link, -4),"_perl.txt") || die;
    print OUTFILE $bar;
}

This Perl script produces a CSV file that looks like the screenshot below, with proper kanji and kana that can be mined and manipulated offline:

CSV file produced by Perl script

My R code, such as it is, looks like the following. The R script is not an exact duplicate of the Perl solution just given, as it doesn't strip out the HTML and leave the text (this answer suggests an approach using R but it doesn't work for me in this case) and it doesn't have the loop and so on, but the intent is the same.

require(RCurl)
require(XML)

links <- list()
links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"
links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"

txt <- getURL(links, .encoding = "UTF-8")
Encoding(txt) <- "bytes"
write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")

This R script generates the output shown in the screenshot below. Basically rubbish.

CSV file produced by R script

I assume that there is some combination of HTML, text and file encoding that will allow me to generate in R a similar result to that of the Perl solution but I cannot find it. The header of the HTML page I'm trying to scrape says the chartset is utf-8 and I have set the encoding in the getURL call and in the write.table function to utf-8, but this alone isn't enough.

The question How can I scrape the above web page using R and save the text as CSV in "well-formed" Japanese text rather than something that looks like line noise?

Edit: I have added a further screenshot to show what happens when I omit the Encoding step. I get what look like Unicode codes, but not the graphical representation of the characters. So it may be some kind of locale-related issue, but in the exact same locale the Perl script does provide useful output. So this is still puzzling.

enter image description here

Developer IT

R: extracting "clean" UTF-8 text from a web page scraped with RCurl - Developer IT

R: extracting "clean" UTF-8 text from a web page scraped with RCurl

r

web-scraping

rcurl

Related posts about r

Related posts about web-scraping

Webscraping Google tasks via Google Calendar

Source for Names to use in web scraping

Web scraping with Python

Web scraping with Python

Python web scraping involving HTML tags with attributes

Categories cloud