I’m planning to build an application which would crawl a part of a local filesystem (a subtree) in a depth-first-search manner and process all files it finds, except for some configurable exceptions.
To give an example, let’s say I have a directory structure like this:
> documents
- generic-doc.txt
> mails
- mail-01.txt
- mail-02.txt
- mail-03.txt
> unread
- mail-04.txt
> invoices
> paid
- invoice-01.pdf
- invoice-02.pdf
> unpaid
- invoice-03.pdf
I also have an exclusion rule like this:
exclude = "documents/mails/unread | documents/invoices"
Given these data on input, my application would process the following documents:
- generic-doc.txt
- mail-01.txt
- mail-02.txt
- mail-03.txt
(e.g. it would process all files, except for those located in the documents/mails/unread and documents/invoices folders)
In future, I might need to implement various forms of exlusion rules.
What is the best way to test the implementation of the crawling module (e.g. that when given an exclusion rule, the module would return the correct set of documents)? Can it be done without using a real filesystem?
Extract the exclusion ruling to a separate module/class/object and test that in isolation. Then make sure, that your crawler asks the ExclusionRule before processing a file.
A sketch
Note that there is already the FileFilter that provides a similar service, maybe you can reuse that abstraction.