I expect there are many possible solutions to this question, I can come up a few myself, some clearly better than others but none that I am certain are optimal so I’m interested in hearing from you real multi threading gurus out there.
I have circa 100 pieces of work that can be executed concurrently as there are no dependencies between them. If I execute these sequentially my total execution time is approx 1:30s. If I queue each piece of work in the thread pool it takes approx 2m, which suggests to me that I am trying to do too much at once and context switching between all these threads is negating the advantage of having those threads.
So based on the assumption (please feel free to shoot me down if this is wrong) that if I only queue up to the number of cores in my system (8 on this machine) pieces of work at any one time I will reduce context switching and thus improve overall efficiency (other process threads not withstanding of course), can anyone suggest the optimal pattern/technique for doing this?
BTW I am using smartthreadpool.codeplex.com, but I don’t have to.
A good threadpool already tries to have one active thread per available core. This isn’t a matter of having one thread for work per core though, as if a thread is blocking (most classically on I/O) you want another thread using that core.
Trying the .NET threadpool instead might be worth a try, or the Parallel class.
If your CPU is hyper-threaded (8 virtual cores on 4 physical) this could be an issue. On average hypter-threading makes things faster, but there are plenty of cases where it makes them worse. Try setting affinity to every other core and see if it gives you an improvement – if it does, then this is likely a case where hyper-threading is bad.
Do you have to gather results together again, or share any resources between the different tasks? The cost of doing this could well be greater than the savings of multi-threading. Perhaps they are so unnecessarily though – e.g. if you are locking on shared data but that data is only ever read, you don’t actually need to read with most data-structures (most but not all are safe for concurrent reads if there are no writes).
The partitioning of the work could be an issue too. Say the single-threaded approach works its way through an area of memory, but the multi-threaded approach gives each thread its next bit of memory to work with round-robin. Here there’d be more cache-flushing per core as the “good next bit” is actually being used by another core. In this situation, splitting work into bigger chunks can fix it.
There are plenty of other factors that can make a multi-threaded approach perform worse than a single-threaded, but those are a few I can think of immediately.
Edit: If you are writing to a shared store, it could be worth trying a run where you just throw away any results. That could narrow down whether that’s where the issue lies.