Perl LWP::UserAgent mishandling UTF-8 response

Posted by RedGrittyBrick on Stack Overflow See other posts from Stack Overflow or by RedGrittyBrick
Published on 2010-12-31T19:44:56Z Indexed on 2011/01/01 11:54 UTC
Read the original article Hit count: 326

Filed under:

utf-8

When I use LWP::UserAgent to retrieve content encoded in UTF-8 it seems LWP::UserAgent doesn't handle the encoding correctly.

Here's the output after setting the Command Prompt window to Unicode by the command chcp 65001 Note that this initially gives the appearance that all is well, but I think it's just the shell reassembling bytes and decoding UTF-8, From the other output you can see that perl itself is not handling wide characters correctly.

C:\>perl getutf8.pl
======================================================================
HTTP/1.1 200 OK
Connection: close
Date: Fri, 31 Dec 2010 19:24:04 GMT
Accept-Ranges: bytes
Server: Apache/2.2.8 (Win32) PHP/5.2.6
Content-Length: 75
Content-Type: application/xml; charset=utf-8
Last-Modified: Fri, 31 Dec 2010 19:20:18 GMT
Client-Date: Fri, 31 Dec 2010 19:24:04 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1

<?xml version="1.0" encoding="UTF-8"?>
<name>Budejovický Budvar</name>

======================================================================
response content length is 33

....v....1....v....2....v....3....v....4
<name>Budejovický Budvar</name>

. . . . v . . . . 1 . . . . v . . . . 2 . . . . v . . . . 3 . . . .
3c6e616d653e427564c49b6a6f7669636bc3bd204275647661723c2f6e616d653e
< n a m e > B u d ? ? j o v i c k ? ?   B u d v a r < / n a m e >

Above you can see the payload length is 31 characters but Perl thinks it is 33. For confirmation, in the hex, we can see that the UTF-8 sequences c49b and c3bd are being interpreted as four separate characters and not as two Unicode characters.

Here's the code

#!perl
use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
my $response = $ua->get('http://localhost/Bud.xml');
if (! $response->is_success) { die $response->status_line; }

print '='x70,"\n",$response->as_string(), '='x70,"\n";

my $r = $response->decoded_content((charset => 'UTF-8')); 
$/ = "\x0d\x0a"; # seems to be \x0a otherwise!
chomp($r);

# Remove any xml prologue
$r =~ s/^<\?.*\?>\x0d\x0a//;

print "Response content length is ", length($r), "\n\n";
print "....v....1....v....2....v....3....v....4\n";
print $r,"\n";

print ". . . . v . . . . 1 . . . . v . . . . 2 . . . . v . . . . 3 . . . . \n";
print unpack("H*", $r), "\n";
print join(" ", split("", $r)), "\n";

Note that Bud.xml is UTF-8 encoded without a BOM.

How can I persuade LWP::UserAgent to do the right thing?

P.S. Ultimately I want to translate the Unicode data into an ASCII encoding, even if it means replacing each non-ASCII character with one question mark or other marker.

I have accepted Ysth's "upgrade" answer - because I know it is the right thing to do when possible. However I am going to use a work-around (which may depress Tom further):

$r = encode("cp437", decode("utf8", $r));

Developer IT

Perl LWP::UserAgent mishandling UTF-8 response - Developer IT

Perl LWP::UserAgent mishandling UTF-8 response

perl

unicode

utf-8

Related posts about perl

Munin on Centos 6 - missing perl MODULE_COMPAT_5.8.8

Pain removing a perl rootkit

How To Avoid a Perl script calling an Another Perl Script

Perl :how to sort dates in perl

please suggest a perl book exclusively for perl programs

Related posts about unicode

Translating Between Unicode and Non-Unicode Character Sets in Java

SQLite, python, unicode, and non-utf data

SQLite, python, unicode, and non-utf data

notepad sql Unicode and Non Unicode

On Windows 7, dir or tree can't show unicode characters, even starting cmd with cmd /U

Categories cloud