Is there a way to efficiently yield every file in a directory containing millions of files?

Posted by Josh Smeaton on Stack Overflow See other posts from Stack Overflow or by Josh Smeaton
Published on 2011-02-23T11:44:30Z Indexed on 2011/02/23 23:25 UTC
Read the original article Hit count: 159

Filed under:
|
|
|

I'm aware of os.listdir, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.

Is there any way to do this? I worry about the case where filenames change, new files are added, and files are deleted using such a method. Some iterators prevent you from modifying the collection during iteration, essentially by taking a snapshot of the state of the collection at the beginning, and comparing that state on each move operation. If there is an iterator capable of yielding filenames from a path, does it raise an error if there are filesystem changes (add, remove, rename files within the iterated directory) which modify the collection?

There could potentially be a few cases that could cause the iterator to fail, and it all depends on how the iterator maintains state. Using S.Lotts example:

filea.txt
fileb.txt
filec.txt

Iterator yields filea.txt. During processing, filea.txt is renamed to filey.txt and fileb.txt is renamed to filez.txt. When the iterator attempts to get the next file, if it were to use the filename filea.txt to find it's current position in order to find the next file and filea.txt is not there, what would happen? It may not be able to recover it's position in the collection. Similarly, if the iterator were to fetch fileb.txt when yielding filea.txt, it could look up the position of fileb.txt, fail, and produce an error.

If the iterator instead was able to somehow maintain an index dir.get_file(0), then maintaining positional state would not be affected, but some files could be missed, as their indexes could be moved to an index 'behind' the iterator.

This is all theoretical of course, since there appears to be no built-in (python) way of iterating over the files in a directory. There are some great answers below, however, that solve the problem by using queues and notifications.

Edit:

The OS of concern is Redhat. My use case is this:

Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.

Edit:

Definition of valid:

Adjective 1. Well grounded or justifiable, pertinent.

(Sorry S.Lott, I couldn't resist).

I've edited the paragraph in question above.

© Stack Overflow or respective owner

Related posts about python

Related posts about list