Introduction
I need to create an iterator over a filesystem subtree (e.g. an iterator that would, given a folder, return all files contained inside, in a depth-first-search order, one at every next method call).
The contents of the subtree can change over time, e.g. it is possible (and probable) that, while the iteration is still in progress, new subfolders and files will get created and some of the existing ones will get deleted.
Fortunatelly the following conditions are acceptable:
-
the implementation can (but it would be better if it did not) skip newly created files (e.g. files that originated after the iteration started) and folders (and files in those folders), or even just some of them,
-
the implementation can (but it would be better if it did not) list deleted files (e.g. files that no longer exist but were present when the iteration started), or even just some of them.
Motivation
In order to give you a better insight into the rationals behind those decisions, I’d like to briefly describe the application as a whole.
It is a producer/consumer -like application. A web service (the producer) would accept files and store them on a local filesystem, somewhere in the subtree hierarchy.
Another application (the consumer) would process these files. It would be invoked periodically via CRON every few minutes. When launched, it would crawl the subtree, find all documents, and hand them over to be processed (to yet another application, if that’s relevant). After a document is processed, it gets deleted from the local filesystem.
The problem is that the producer and the consumer would be running at the same time. Moreover, multiple instances of the consumer application might be running at the same time as well. E.g. while a consumer is crawling the subtree, new documents might get created and existing documents might get deleted. Even the structure of subdirectories might get modified.
Because the crawler gets launched periodically every few minutes, it does not matter if it consumes all the documents available at the time (especially those produced while the consumer is running). It is only important that a produced document gets eventually consumed (with a reasonably small delay). That’s where the relaxing conditions listed above come from.
Possible solutions
I first thought I would create a snapshot of the subtree into memory at launch time (e.g. the list of documents to be processed) and then iterate over them. See my other post. But the hierarchy might be very large (even tens of thousands of documents processed per a few hours) and I was thinking that this approach might have unacceptable performace demands (memory & speed).
How would you implement such an iterator?
Thanks very much for your help and sorry for the big length of the post.
Since you cannot use JDK 7 directly you may still look at how they did it there : FileTreeWalker