Python line file iteration and strange characters

Posted by muckabout on Stack Overflow See other posts from Stack Overflow or by muckabout
Published on 2010-04-29T13:57:43Z Indexed on 2010/04/29 14:17 UTC
Read the original article Hit count: 282

Filed under:
|
|
|

I have a huge gzipped text file which I need to read, line by line. I go with the following:

for i, line in enumerate(codecs.getreader('utf-8')(gzip.open('file.gz'))):
  print i, line

At some point late in the file, the python output diverges from the file. This is because lines are getting broken due to weird special characters that python thinks are newlines. When I open the file in 'vim', they are correct, but the suspect characters are formatted weirdly. Is there something I can do to fix this?

I've tried other codecs including utf-16, latin-1. I've also tried with no codec.

I looked at the file using 'od'. Sure enough, there are \n characters where they shouldn't be. But, the "wrong" ones are prepended by a weird character. I think there's some encoding here with some characters being 2-bytes, but the trailing byte being a \n if not viewed properly.

If I replace:

gzip.open('file.gz')

With:

os.popen('zcat file.gz')

It works fine (and actually, quite faster). But, I'd like to know where I'm going wrong.

© Stack Overflow or respective owner

Related posts about python

Related posts about codec