Given (particularly) the following scenario:
- One thread per core,
- Each core having its own distinct cache,
- Programs where cache hit/miss ratios are central to good performance (i.e. most today)
I’ve read frequently about the benefits of thread pooling for scheduling work in multicore systems. Although there are a number of approaches to multithreading, the comparison is often made between a smarter, load-balancing approach like this, and a more naive, “assign threads by task type” approach where load-balancing is assumed to have been handled at development time, rather than by the system itself at runtime. An example of this might be dedicated number crunching on one thread and rendering tasks on another.
It seems to me that under the above conditions, the thread-by-task-type approach could lead to far better performance since that core’s local cache would be that much more efficient for the specific task to which it has been assigned? (Assuming that waiting is not much of an issue, i.e. both threads are running at or close to full steam.)
I also wonder what performance impact thread-safety mechanisms might have, in load-balanced vs naive approaches.
The biggest performance gain with threaded programming is not about the type of task that is forked off but rather how asynchronous its operation is. If the task is tied closely with shared resources meaning that it has to lock and synchronize memory often, then you are not going to get the benefits of the cache memory on the separate processors.
Edit: I now see where you are going Nick. A load-balancing system is by definition, consuming from some common task queue so may have more synchronization points. However, disparate tasks may be dealing with the database or writing to the file-system and blocking on other resources that hamper its ability to run asynchronously. And even though a load-balancing approach might need more synchronization, it can be the easiest way to significantly improve the throughput of your application.
Right. This is the critical point. Thread-safety implies synchronization or other locking as well as writing dirty memory to central storage, and invalidating cache memory that has been updated by other threads. The more a threaded task has to perform such operations, the more its performance will suffer.
Again, this is not necessarily about the type of task, load-balancing versus task-type, but about the specifics of the code necessary to perform the tasks.
Lastly, it is important to realize that for most threaded programs, you will have more threads than CPU/cores — especially considering that there is often background GC, finalizer, and other VM threads (assuming you are on a VM) competing for CPU resources. This makes it even more difficult to optimize a threaded program to make use of thread-to-CPU affinity which seems to be implied in your question.