I am working on a problem where I need to perform a lot of embarrassingly parallelizable tasks. The task is created by reading data from the database but a collection of all tasks would exceed the amount of memory on the machine so tasks have to be created, processed and disposed. I am wondering what would be a good approach to solve this problem? I am thinking the following two approaches:
-
Implement a synchronized task queue. Implement a producer (task creater) that read data from database and put task in the queue (limit the number of tasks currently in the queue to a constant value to make sure that the amount of memory is not exceeded). Have multiple consumer processes (task processor) that read task from the queue, process task, store the result and dispose the task. What would be a good number of consumer processes in this approach?
-
Use .NET Parallel extension (PLINQ or parallel for), but I understand that a collection of tasks have to be created (Can we add tasks to the collection while processing in the parallel for?). So we will create a batch of tasks — say N tasks at a time and do process these batch of tasks and read another N tasks.
What are your thoughts on these two approaches?
Use a ThreadPool with a bounded queue to avoid overwhelming the system.
If each of your worker tasks is CPU bound then configure your system initially so that the number of threads in your system is equal to the number of hardware threads that your box can run.
If your tasks aren’t CPU bound then you’ll have to experiment with the pool size to get an optimal solution for your particular situation
You may have to experiment with either approach to get to the optimal configuration.
Basically, test, adjust, test, repeat until you’re happy.