Search Results

Search found 2 results on 1 pages for 'duhaime'.

Page 1/1 | 1 

  • Python: Removing particular character (u"\u2610") from string

    - by duhaime
    I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character. (You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.) To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove. for work in glob.glob(pathtofiles): openfile = open(work) readfile = openfile.read() stringfile = str(readfile) decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line? soup = BeautifulSoup(decodefile) textwithtags = soup.findAll('text') textwithtagsasstring = str(textwithtags) #this method strips everything between anglebrackets as it should textwithouttags = stripTags(textwithtagsasstring) #clean text nonewlines = textwithouttags.replace("\n", " ") noextrawhitespace = re.sub(' +',' ', nonewlines) print noextrawhitespace #the boxes appear I tried to remove the boxes by using noboxes = noextrawhitespace.replace(u"\u2610", "") But Python threw an error flag: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128) Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.

    Read the article

  • Python: Determine whether list of lists contains a defined sequence

    - by duhaime
    I have a list of sublists, and I want to see if any of the integer values from the first sublist plus one are contained in the second sublist. For all such values, I want to see if that value plus one is contained in the third sublist, and so on, proceeding in this fashion across all sublists. If there is a way of proceeding in this fashion from the first sublist to the last sublist, I wish to return True; otherwise I wish to return False. In other words, for each value in sublist one, for each "step" in a "walk" across all sublists read left to right, if that value + n (where n = number of steps taken) is contained in the current sublist, the function should return True; otherwise it should return False. (Sorry for the clumsy phrasing--I'm not sure how to clean up my language without using many more words.) Here's what I wrote. a = [ [1,3],[2,4],[3,5],[6],[7] ] def find_list_traversing_walk(l): for i in l[0]: index_position = 0 first_pass = 1 walking_current_path = 1 while walking_current_path == 1: if first_pass == 1: first_pass = 0 walking_value = i if walking_value+1 in l[index_position + 1]: index_position += 1 walking_value += 1 if index_position+1 == len(l): print "There is a walk across the sublists for initial value ", walking_value - index_position return True else: walking_current_path = 0 return False print find_list_traversing_walk(a) My question is: Have I overlooked something simple here, or will this function return True for all true positives and False for all true negatives? Are there easier ways to accomplish the intended task? I would be grateful for any feedback others can offer!

    Read the article

1