I’ve got a job processor which needs to handle ~300 jobs in parallel (jobs can take up to 5 minutes to complete, but they are usually network-bound).
The issue I’ve got is that jobs tend to come in clumps of a specific type. For simplicity, let’s say there are six job types, JobA through JobF.
JobA – JobE are network bound and can quite happily have 300 running together without taxing the system at all (actually, I’ve managed to get more than 1,500 running side-by-side in tests). JobF (a new job type) is also network-bound, but it requires a considerable chunk of memory and actually uses GDI functionality.
I’m making sure I carefully dispose of all GDI objects with usings and according to the profiler, I’m not leaking anything. It’s simply that running 300 JobF in parallel uses more memory than .NET is willing to give me.
What’s the best practice way of dealing with this? My first thought was to determine how much memory overhead I had and throttle spawning new jobs as I approach the limit (at least JobF jobs). I haven’t been able to achieve this as I can’t find any way to reliably determine what the framework is willing to allocate me in terms of memory. I’d also have to guess at the maximum memory used by a job which seems a little flakey.
My next plan was to simply throttle if I get OOMs and re-schedule the failed jobs. Unfortunately, the OOM can occur anywhere, not just inside the problematic jobs. In fact, the most common place is the main worker thread which manages the jobs. As things stand, this causes the process to do a graceful shutdown (if possible), restart and attempt to recover. While this works, it’s nasty and wasteful of time and resources – far worse than just recycling that particular job.
Is there a standard way to handle this situation (adding more memory is an option and will be done, but the application should handle this situation properly, not just bomb out)?
I am doing something remotely similar to your case, and I opted for the approach in which I have ONE task processor (main queue manager that runs on ONE node) and as much AGENTS that run on one or more nodes.
Each of the agents run as a separate process. They:
Queue manager is designed in a way so if any agent fails during execution of the job, it will be simply re-tasked to another agent after some time.
BTW, consider NOT having all the tasks run at once in parallel, since there really is some overhead (it might be substantial) in switching context. In your case, you might be saturating network with an unnecessary PROTOCOL traffic instead real DATA traffic.
Another fine point of this design is that if I start to fall behind on data processing, I can always turn on one more machine (say Amazon C2 instance) and run several agents more that will help complete the task-base more quickly.
In answer to your question:
Every host will take as much as it can, since there are finite number of agents that run on one host. When one task is DONE, another is taken and ad infinitum. I don’t use database. Tasks aren’t time-critical, so I have one process that goes round and round on the incoming data-set and creates new tasks if something failed in previous run(s). Concretely:
http://access3.streamsink.com/archive/ (source data)
http://access3.streamsink.com/tbstrips/ (calculated results)
On each queue manager run, source and destination are scanned, resulting sets subtracted and filenames turned into tasks.
And still some more:
I am using web services to get job info/returning results and simple http to get the data for processing.
Finally:
This is simpler of the 2 manager/agent pairs that I have – other one is somehow more complicated so I won’t go into detail of it here. Use the e-mail 🙂