ruby 1.9: invalid byte sequence in UTF-8

Posted by Marc Seeger on Stack Overflow See other posts from Stack Overflow or by Marc Seeger
Published on 2010-06-06T00:35:34Z Indexed on 2010/06/06 0:42 UTC
Read the original article Hit count: 387

Filed under:
|

I'm writing a crawler in ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...

© Stack Overflow or respective owner

Related posts about ruby

Related posts about encoding