We have a batch process consisting of about 5 calculations that happens on each row of data (20 million rows total). Our production server will have around 24 processors with decent CPUs.
Performance is critical for us. Assuming that our algorithms are pretty efficient, what would be the best way to achieve maximum time performance for this? Specifically, should we be able to achieve better performance through multi-threading, using threadpools, etc? Also, could use of the Process object to divide up the batch into multiple programs be of benefit?
A few thoughts:
First, you need to put a bit more definite around “best” – there are trade-offs involved in performing such massive processing. Specifically, memory, I/O and CPU utilization are considerations. How much memory each calculation requires. And so on.
Assuming that you are the only process on the machine, you have lots of memory, and you are primarily interested in optimizing throughput, here are some suggestions:
In additional to thread pools, there is also the Task Parallel Library, which offers facilities for simplifying the development of such parallel computations. It is specifically designed to scale to the number of cores and optimize the way threads are used. There’s also Parallel LINQ, which you may also find useful.