I need to process incoming xml files (they will be created by other application directly in specific folder) and I need to do it fast.
There can be up to 200 000 files per day and my current assumption is to use .NET 4 and tpl.
My current service concept is:
In a loop I want to check folder for new files, if I find any of them, I will put them to queue, which will be processed by another loop which will take files from queue and create for each of them new task(thread). Number of simultaneous tasks should be configurable.
First part is easy but creating two main loops with queue between them is something new for me.
And the question:
How to create two loops(one for checking folder and adding files and second for taking files from queue and process them parallel) and add queue to communicate between them.
For first part (folder checking) suggested solution is to use FileSystemWatcher. Now second part needs to be discussed (maybe some Task Scheduler).
May not need loops, not sure parallel is necessary either. That would be useful if you want to process a batch of new files.
FileSystemWatcher on the folder where new files will appear, will give you an event to add a file to the queue.
Add an event for item added to queue, to trigger a thread to process an individual file.
If you knock up a simple class, File, state, detected time etc.
You’d have a detection thread adding to the queue, a threadpool to process them and on success remove them from the queue.
You might find this previous question useful threasafe “lists” in .net 4
Thread-safe List<T> property
Particularly if you want to process all new files since X.
Note if you aren’t going to use FileSystem watcher and just get files from the folder, a Processed folder to move them to and maybe a Failed Folder as well, would be a good idea. Reading 200,00 filenames in to check to see if you’ve processed them would sort of remove any benefit from parallel processing them.
Even if you do, I’d recomend it. Just moving it back in to To Process (or after an edit in case of failures) will trigger it to be reprocessed. Another advantage is say if you are processing into a database and it all goes nipples up and your last back up was at X. You restore and then simply move all the files you did process back into the “toprocess” folder.
You can also do test runs with known input and check the db’s state before and after.
Further to comment.
ThreadPool which is used by Task has a ThreadPool limit put that’s for all for or background tasks in yor app.
After comment.
If you want to limit the number of concurrent tasks…
Starter for ten you can easily improve upon, for tuning and boosting.
In your class that manages kicking off tasks from the file queue, something like
Simple and safe (I think). You could have another property pause or disable to check as well. Might want to make the above a singleton ( 🙁 ), or at least bear in mind that what if you run more than one….
Best advice I can give is start simple, open and decoupled, and then complicate as necessary, be easy to start optimising prematurely here. A good idea not to have a load a of threads all waiting on say the FileSystem, or a backend, but I doubt number of processors is ever going to be a bottleneck, so your maxTasks is a bit thumb in the air.
Some sort of self tune between a lower and upper limit might be a good thing as opposed to one fixed number.