Set a script to automatically detect character encoding in a plain-text-file in Python?

Posted by Haidon on Stack Overflow See other posts from Stack Overflow or by Haidon
Published on 2010-06-14T02:27:15Z Indexed on 2010/06/14 2:32 UTC
Read the original article Hit count: 339

Filed under:

python

|

character-encoding

|

find-and-replace

I've set up a script that basically does a large-scale find-and-replace on a plain text document.

At the moment it works fine with ASCII, UTF-8, and UTF-16 (and possibly others, but I've only tested these three) encoded documents so long as the encoding is specified inside the script (the example code below specifies UTF-16).

Is there a way to make the script automatically detect which of these character encodings is being used in the input file and automatically set the character encoding of the output file the same as the encoding used on the input file?

findreplace = [
('term1', 'term2'),
]    

inF = open(infile,'rb')
    s=unicode(inF.read(),'utf-16')
    inF.close()

    for couple in findreplace:
        outtext=s.replace(couple[0],couple[1])
        s=outtext

    outF = open(outFile,'wb')
    outF.write(outtext.encode('utf-16'))
    outF.close()

Thanks!

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about character-encoding

Harmonizing Character Encoding Between Imported Data and MySQL

as seen on Internet.com - Search for 'Internet.com'
MySQL's Latin-1 default encoding combined with MySQL 4.1.12's (or greater) UTF8 encoding allows the maximum number of characters codes, however incoming data with different character encoding can still present problems. Rob Gravelle shows you how to avoid problems before a lot of work is required… >>> More
Harmonizing Character Encoding Between Imported Data and MySQL

as seen on Internet.com - Search for 'Internet.com'
MySQL's Latin-1 default encoding combined with MySQL 4.1.12's (or greater) UTF8 encoding allows the maximum number of characters codes, however incoming data with different character encoding can still present problems. Rob Gravelle shows you how to avoid problems before a lot of work is required… >>> More
How to cross-reference many character encodings with ASCII OR UTFx?

as seen on Programmers - Search for 'Programmers'
I'm working with a binary structure, the goal of which is to index the significance of specific bits for any character encoding so that we may trigger events while doing specific checks against the profile. Each character encoding scheme has an associated system record. This record's leading value… >>> More
Determining default character set of platform in Java

as seen on Stack Overflow - Search for 'Stack Overflow'
I am programming in Java I have the code as: byte[] b = test.getBytes(); In the api it is specified that if we do not specify character encoding it takes the default platform character encoding. What is meant by "default platform character encoding" ? Does it mean the Java encoding or the OS… >>> More
Perl character encoding

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi People, I have an environment variable set in Windows as TEST=abc£ which uses Windows-1252 code page. Now when I run a perl program - 'test.pl', this environment value comes properly. When I call another perl code - 'test2.pl' from 'test1.pl' either by system(..) or Win32::process(..), the environment… >>> More