Python: Removing particular character (u"\u2610") from string

Posted by duhaime on Stack Overflow See other posts from Stack Overflow or by duhaime
Published on 2013-10-22T21:37:48Z Indexed on 2013/10/22 21:53 UTC
Read the original article Hit count: 192

Filed under:

I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.

(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)

To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.

for work in glob.glob(pathtofiles):

    openfile = open(work)
    readfile = openfile.read()
    stringfile = str(readfile)

    decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
    soup = BeautifulSoup(decodefile)

    textwithtags = soup.findAll('text')

    textwithtagsasstring = str(textwithtags)

    #this method strips everything between anglebrackets as it should
    textwithouttags = stripTags(textwithtagsasstring)

    #clean text
    nonewlines = textwithouttags.replace("\n", " ")
    noextrawhitespace = re.sub(' +',' ', nonewlines)

    print noextrawhitespace #the boxes appear

I tried to remove the boxes by using

noboxes = noextrawhitespace.replace(u"\u2610", "")

But Python threw an error flag:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)

Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.

Developer IT

Python: Removing particular character (u"\u2610") from string - Developer IT

Python: Removing particular character (u"\u2610") from string

python

Xml

string

unicode

ascii

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about Xml

Store XML,update record in XML,retrive a specific record in XML stored on BB device

gwt+xml- can i read through incomplete XML using the GWT XML Parser

perl xml parser get xml content within xml

Reading php generated XML in flash?

Announcing RSS feeds of Microsoft All-In-One Code Framework code samples

Categories cloud