String manipulation appears to be inefficient

Posted by user2964780 on Stack Overflow See other posts from Stack Overflow or by user2964780
Published on 2013-11-07T12:34:18Z Indexed on 2013/11/08 3:54 UTC
Read the original article Hit count: 75

Filed under:

I think my code is too inefficient. I'm guessing it has something to do with using strings, though I'm unsure. Here is the code:

genome = FASTAdata[1]
genomeLength = len(genome);

# Hash table holding all the k-mers we will come across
kmers = dict()

# We go through all the possible k-mers by index
for outer in range (0, genomeLength-1):
    for inner in range (outer+2, outer+22):
        substring = genome[outer:inner]
        if substring in kmers:              # if we already have this substring on record, increase its value (count of num of appearances) by 1
            kmers[substring] += 1
        else:
            kmers[substring] = 1            # otherwise record that it's here once

This is to search through all substrings of length at most 20. Now this code seems to take pretty forever and never terminate, so something has to be wrong here. Is using [:] on strings causing the huge overhead? And if so, what can I replace it with?

And for clarity the file in question is nearly 200mb, so pretty big.

© Stack Overflow or respective owner

Related posts about python