system io - Page 122 - Developer IT

What are cons on reading a file which already used for streaming to writing?

- by rahul

Wondering if one file is open for writing in and another programe is accessing it recursively in tcl.

Efficient file buffering & scanning methods for large files in python

The description of the problem I am having is a bit complicated, and I will err on the side of providing more complete information. For the impatient, here is the briefest way I can summarize it: What is the fastest (least execution time) way to split a text file in to ALL (overlapping) substrings of size N (bound N, eg 36) while throwing out newline characters. I am writing a module which parses files in the FASTA ascii-based genome format. These files comprise what is known as the 'hg18' human reference genome, which you can download from the UCSC genome browser (go slugs!) if you like. As you will notice, the genome files are composed of chr[1..22].fa and chr[XY].fa, as well as a set of other small files which are not used in this module. Several modules already exist for parsing FASTA files, such as BioPython's SeqIO. (Sorry, I'd post a link, but I don't have the points to do so yet.) Unfortunately, every module I've been able to find doesn't do the specific operation I am trying to do. My module needs to split the genome data ('CAGTACGTCAGACTATACGGAGCTA' could be a line, for instance) in to every single overlapping N-length substring. Let me give an example using a very small file (the actual chromosome files are between 355 and 20 million characters long) and N=8 import cStringIO example_file = cStringIO.StringIO("""\ header CAGTcag TFgcACF """) for read in parse(example_file): ... print read ... CAGTCAGTF AGTCAGTFG GTCAGTFGC TCAGTFGCA CAGTFGCAC AGTFGCACF The function that I found had the absolute best performance from the methods I could think of is this: def parse(file): size = 8 # of course in my code this is a function argument file.readline() # skip past the header buffer = '' for line in file: buffer += line.rstrip().upper() while len(buffer) = size: yield buffer[:size] buffer = buffer[1:] This works, but unfortunately it still takes about 1.5 hours (see note below) to parse the human genome this way. Perhaps this is the very best I am going to see with this method (a complete code refactor might be in order, but I'd like to avoid it as this approach has some very specific advantages in other areas of the code), but I thought I would turn this over to the community. Thanks! Note, this time includes a lot of extra calculation, such as computing the opposing strand read and doing hashtable lookups on a hash of approximately 5G in size. Post-answer conclusion: It turns out that using fileobj.read() and then manipulating the resulting string (string.replace(), etc.) took relatively little time and memory compared to the remainder of the program, and so I used that approach. Thanks everyone!

Search Results

Search found 47692 results on 1908 pages for 'system io'.

Page 122/1908 | < Previous Page | 118 119 120 121 122 123 124 125 126 127 128 129 | Next Page >

- by rahul

- by eblume

- by sergio

- by rearden

- by Marquinio

- by eyecreate

- by Tom Brito

- by Joe

- by willsan

- by Anonymous

- by Fred

- by Austin

- by Nathan

- by jumbojs

- by rwmnau

- by user1039583

- by sasha

- by Fabian Steeg

- by Mike

- by Sanoj

- by jclova

- by Sniperchild

- by Kevin Gebhardt

- by iOS-Newbie

- by S. Singh

< Previous Page | 118 119 120 121 122 123 124 125 126 127 128 129 | Next Page >