ruby 1.9: invalid byte sequence in UTF-8

Posted by Marc Seeger on Stack Overflow See other posts from Stack Overflow or by Marc Seeger
Published on 2010-06-06T00:35:34Z Indexed on 2010/06/06 0:42 UTC
Read the original article Hit count: 433

Filed under:

ruby

|

encoding

I'm writing a crawler in ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...

© Stack Overflow or respective owner

Related posts about ruby

Setting up Rails to work with sqlserver

as seen on Stack Overflow - Search for 'Stack Overflow'
Ok I followed the steps for setting up ruby and rails on my Vista machine and I am having a problem connecting to the database. Contents of database.yml development: adapter: sqlserver database: APPS_SETUP Host: WindowsVT06\SQLEXPRESS Username: se Password: paswd Run rake db:migrate… >>> More
marshal data too short!!!

as seen on Stack Overflow - Search for 'Stack Overflow'
My application requires to keep large data objects in session. There are like 3-4 data objects each created by parsing a csv containing 150 X 20 cells having strings of 3-4 characters. My application shows this error- "marshal data too short". I tried this- Deleting the old session table. Deleting… >>> More
Sinatra and XML POST request

as seen on Stack Overflow - Search for 'Stack Overflow'
I don't know is it my mistake or no. So i have that code: <code> post '/singin/get_token' do content_type :xml puts request.body.read puts xmlRequest xmlRequest = REXML::Document.new(request.body.read) ... </code> And when i post something like that: <code> <?xml… >>> More
how to change ruby path from /usr/bin/ruby to /usr/local/bin/ruby

as seen on Stack Overflow - Search for 'Stack Overflow'
reading around the various ruby install tutorials it's required to change path from /usr/bin/ruby to /usr/local/bin/ruby but i cant seem to be able to do it. Ultimately i want to install Ruby 1.9.2, should i uninstall 1.8.7 or what? i tried to install Ruby 1.9.2 with macports, the installation seemed… >>> More
strange bundler error: tar_input.rb:49:in `initialize': not in gzip format (Zlib::GzipFile::Error) o

as seen on Stack Overflow - Search for 'Stack Overflow'
i am getting a strange bundler error when running bundle pack with bundler 0.9.12 any ideas? (see pastie for a better formatted code: http://pastie.org/881328 ) /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/site_ruby/1.8/rubygems/package/tar_input.rb:49:in `initialize': not in gzip format (Zlib::GzipFile::Error) … >>> More

Related posts about encoding

<?xml version=“1.0” encoding=“UTF-8”?> not <?xml version='1.0' encoding='UTF-8'?>

as seen on Stack Overflow - Search for 'Stack Overflow'
I am using lxml with tree.write(xmlFileOut, pretty_print = True, xml_declaration = True, encoding='UTF-8' to write out my opened and edited xml file, but I absolutely need to have the xml declaration as <?xml version=“1.0” encoding=“UTF-8”?> and NOT <?xml version='1.0' encoding='UTF-8'… >>> More
Ivar definitions show 'long' type encoding as 'long long' type encoding

as seen on Stack Overflow - Search for 'Stack Overflow'
I've found what I think may be a bug with Ivar and Objective-C runtime. I'm using XCode 3.2.1 and associated libraries, developing a 64 bit app on X86_64 (MacBook Pro). Where I would expect the type encoding for the following "longVal" to be 'l', the Ivar encoding is showing a 'q' (which is a 'long… >>> More
How to avoid encoding the key of request parameters being encoding

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to send a http request using WS.url() with a action receive a custom class parameter like public static void add(@Valid MyPage info) {...} There is a Map in MyPage @Required public Map<String, String> content = new HashMap<String, String>(); But When I try to send a request… >>> More
C# Check if character exists in encoding

as seen on Stack Overflow - Search for 'Stack Overflow'
I am writing a program that a part renders a bitmap font in CP437. In a function that renders the text with I want to be able to check whether a char is available in CP437 before the encoding conversion, like: public static void DrawCharacter(this Graphics g, char c) { if (char_exist_in_encoding(Encoding… >>> More
How to detect the character encoding of a text file?

as seen on Stack Overflow - Search for 'Stack Overflow'
I try to detect which character encoding is used in my file. I try with this code to get the standard encoding public static Encoding GetFileEncoding(string srcFile) { // *** Use Default of Encoding.Default (Ansi CodePage) Encoding enc = Encoding.Default; // *** Detect byte… >>> More