Search for string allowing for one mismatches in any location of the string, Python

Posted by Vincent on Stack Overflow See other posts from Stack Overflow or by Vincent
Published on 2010-03-10T20:42:30Z Indexed on 2010/03/12 1:17 UTC
Read the original article Hit count: 407

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite) I am not sure how large the genome is but much more that 230,000 sequences.

I need to look for each of my sequences of 25 characters example(AGCCTCCCATGATTGAACAGATCAT). The genome is formatted as a continuous string ie (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT.........)

I don't care where or how many times it is found, just yes or no. This is simple I think, str.find(AGCCTCCCATGATTGAACAGATCAT)

But I also what to find a close match defined as wrong(mismatched) at any location but only 1 location and record the location in the sequnce. I am not sure how do do this. The only thing I can think of is using a wildcard and performing the search with a wildcard in each position. ie search 25 times. For example AGCCTCCCATGATTGAACAGATCAT AGCCTCCCATGATAGAACAGATCAT close match with a miss-match at position 13

Speed is not a big issue I am only doing it 3 times. i hope but it would be nice it was fast.

The are programs that do this find matches and partial matches but I am looking for a type of partial match that is not available with these applications.

Here is a similar post for pearl but they are only comparing sequnces not searching a continuous string

Related post

© Stack Overflow or respective owner

Related posts about python

Related posts about dna-sequence