biopython - Developer IT

BioPython: extracting sequence IDs from a Blast output file

- by Jon

Hi, I have a BLAST output file in XML format. It is 22 query sequences with 50 hits reported from each sequence. And I want to extract all the 50x22 hits. This is the code I currently have, but it only extracts the 50 hits from the first query. from Bio.Blast import NCBIXM blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() save_file = open("/Users/jonbra/Desktop/my_fasta_seq.fasta", 'w') for alignment in blast_record.alignments: for hsp in alignment.hsps: save_file.write('>%s\n' % (alignment.title,)) save_file.close() Somebody have any suggestions as to extract all the hits? I guess I have to use something else than alignments. Hope this was clear. Thanks! Jon

Read the article

Parse large XML file w/ script or use BioPython API ?

- by jeremy04

Hey guys this is my first question on here. I'm trying to make a local copy of the UniprotKB in SQL. The UniprotKB is 2.1GB, and it comes in XML and a special text format used by SwissProt Here are my options: 1) Use a SAX parser (XML) - I chose Ruby, and Nokogiri. I started writing the parser, but my initial reaction: how would I map the XML schema to the SAX parser? 2) BioPython - I already have BioSQL/Biopython installed, which literally created my SQL schema for me, and I was able to successfully insert one SwissProt/Uniprot txt file into the database. I'm running it right now (crosses fingers) on the entire 2.1gb. Here is the code I'm running: from Bio import SeqIO from BioSQL import BioSeqDatabase from Bio import SwissProt server = BioSeqDatabase.open_database(driver = "MySQLdb", user = "root", passwd = "", host="localhost", db = "bioseqdb") db = server["uniprot"] iterator = SeqIO.parse(open("/path/to/uniprot_sprot.dat", "r"), "swiss") db.load(iterator) server.commit() Edit: it's now crashing because the transactions are getting locked (since the tables are Innodb) Error Number: 1205 Lock wait timeout exceeded; try restarting transaction. I'm using MySQL version: 5.1.43 Should I switch my database to Postgrelsql ?

Read the article

Subprocess fails to catch the standard output

- by user343934

I am trying to generate tree with fasta file input and Alignment with MuscleCommandline import sys,os, subprocess from Bio import AlignIO from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="c:\Python26\opuntia.fasta") child= subprocess.Popen(str(cline), stdout = subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) align=AlignIO.read(child.stdout,"fasta") outfile=open('c:\Python26\opuntia.phy','w') AlignIO.write([align],outfile,'phylip') outfile.close() I always encounter with these problems Traceback (most recent call last): File "<string>", line 244, in run_nodebug File "C:\Python26\muscleIO.py", line 11, in <module> align=AlignIO.read(child.stdout,"fasta") File "C:\Python26\Lib\site-packages\Bio\AlignIO\__init__.py", line 423, in read raise ValueError("No records found in handle") ValueError: No records found in handle

Read the article

Can anyone tell me why these lines are not working?

- by user343934

I am trying to generate tree with fasta file input and Alignment with MuscleCommandline import sys,os, subprocess from Bio import AlignIO from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="c:\Python26\opuntia.fasta") child= subprocess.Popen(str(cline), stdout = subprocess.PIPE, stderr=subprocess.PIPE, shell=(sys.platform!="win32")) align=AlignIO.read(child.stdout,"fasta") outfile=open('c:\Python26\opuntia.phy','w') AlignIO.write([align],outfile,'phylip') outfile.close() I always encounter with these problems Traceback (most recent call last): File "", line 244, in run_nodebug File "C:\Python26\muscleIO.py", line 11, in align=AlignIO.read(child.stdout,"fasta") File "C:\Python26\Lib\site-packages\Bio\AlignIO_init_.py", line 423, in read raise ValueError("No records found in handle") ValueError: No records found in handle

Read the article

How to incorporate existing open source software from a licensing perspective?

- by Matt

I'm working on software that uses the following libraries: Biopython SciPy NumPy All of the above have licenses similar to MIT or BSD. Three scenarios: First, if I don't redistribute those dependencies, and only my code, then all I need is my own copyright and license (planing on using the MIT License) for my code. Correct? What if I use py2exe or py2app to create a binary executable to distribute so as to make it easy for people to run the application without needing to install python and all the dependencies. Of course this also means that my binary file(s) contains python itself (along with any other packages I might have performed a pip install xyz). What if I bundle Biopython, SciPy, and NumPy binaries in my package? In the latter two cases, what do I need to do to comply with copyright laws.

Read the article

50 sequences in one line

- by user343934

I have Muttiple sequence alignment (clustal) file and i want to read this file and arrange sequences in a such a way that in looks more clear and precise in order. I am doing this from biopython using AlignIO object. My codes like this alignment = AlignIO.read("opuntia.aln", "clustal") print "Number of rows: %i" % len(align) for record in alignment: print "%s - %s" % (record.id, record.seq) My Output-- http://i48.tinypic.com/ae48ew.jpg , it looks messy and long scrolling. What i want to do is print only 50 sequences in each line and continue till the end of alignment file. I wish to have output like this---http://i45.tinypic.com/4vh5rc.jpg from --http://www.ebi.ac.uk/Tools/clustalw2/, sorry two links are just a text due to my reputation. Any suggestions, algorithm and sample code is appreciated Thanks in advance Br,

Read the article

File management

- by user343934

I am working on python and biopython right now. I have a file upload form and whatever file is uploaded suppose(abc.fasta) then i want to pass same name in execute (abc.fasta) function parameter and display function parameter (abc.aln). Right now i am changing file name manually, but i want to have it automatically. Workflow goes like this. ----If submit is not true then display only header and form part --- if submit is true then call execute() and get file name from form input --- Then display the save file result in the same page. File name is same as input. My raw code is here -- http://pastebin.com/FPUgZSSe Any suggestions, changes and algorithm is appreciated Thanks

Read the article

saving file name from html form input

- by user343934

I am working on python and biopython right now. I have a file upload form and whatever file is uploaded suppose(abc.fasta) then i want to pass same name in execute (abc.fasta) function parameter and display function parameter (abc.aln). Workflow goes like this. ----If submit is not true then display only header and form part --- if submit is true then call execute() and get file name from form input --- Then display the save file result in the same page. File name is same as input. My raw code is here -- http://pastebin.com/FPUgZSSe Any suggestions, changes and algorithm is appreciated Thanks

Read the article

Efficient file buffering & scanning methods for large files in python

- by eblume

The description of the problem I am having is a bit complicated, and I will err on the side of providing more complete information. For the impatient, here is the briefest way I can summarize it: What is the fastest (least execution time) way to split a text file in to ALL (overlapping) substrings of size N (bound N, eg 36) while throwing out newline characters. I am writing a module which parses files in the FASTA ascii-based genome format. These files comprise what is known as the 'hg18' human reference genome, which you can download from the UCSC genome browser (go slugs!) if you like. As you will notice, the genome files are composed of chr[1..22].fa and chr[XY].fa, as well as a set of other small files which are not used in this module. Several modules already exist for parsing FASTA files, such as BioPython's SeqIO. (Sorry, I'd post a link, but I don't have the points to do so yet.) Unfortunately, every module I've been able to find doesn't do the specific operation I am trying to do. My module needs to split the genome data ('CAGTACGTCAGACTATACGGAGCTA' could be a line, for instance) in to every single overlapping N-length substring. Let me give an example using a very small file (the actual chromosome files are between 355 and 20 million characters long) and N=8 import cStringIO example_file = cStringIO.StringIO("""\ header CAGTcag TFgcACF """) for read in parse(example_file): ... print read ... CAGTCAGTF AGTCAGTFG GTCAGTFGC TCAGTFGCA CAGTFGCAC AGTFGCACF The function that I found had the absolute best performance from the methods I could think of is this: def parse(file): size = 8 # of course in my code this is a function argument file.readline() # skip past the header buffer = '' for line in file: buffer += line.rstrip().upper() while len(buffer) = size: yield buffer[:size] buffer = buffer[1:] This works, but unfortunately it still takes about 1.5 hours (see note below) to parse the human genome this way. Perhaps this is the very best I am going to see with this method (a complete code refactor might be in order, but I'd like to avoid it as this approach has some very specific advantages in other areas of the code), but I thought I would turn this over to the community. Thanks! Note, this time includes a lot of extra calculation, such as computing the opposing strand read and doing hashtable lookups on a hash of approximately 5G in size. Post-answer conclusion: It turns out that using fileobj.read() and then manipulating the resulting string (string.replace(), etc.) took relatively little time and memory compared to the remainder of the program, and so I used that approach. Thanks everyone!

Developer IT