For a class project I am writing a simple matrix multiplier in Python. My professor has asked for it to be threaded. The way I handle this right now is to create a thread for every row and throw the result in another matrix.
What I wanted to know if it would be faster that instead of creating a thread for each row it creates some amount threads that each handles various rows.
For example: given Matrix1 100×100 * Matrix2 100×100 (matrix sizes can vary widely):
- 4 threads each handling 25 rows
- 10 threads each handling 10 rows
Maybe this is a problem of fine tuning or maybe the thread creation process overhead is still faster than the above distribution mechanism.
You will probably get the best performance if you use one thread for each CPU core available to the machine running your application. You won’t get any performance benefit by running more threads than you have processors.
If you are planning to spawn new threads each time you perform a matrix multiplication then there is very little hope of your multi-threaded app ever outperforming the single-threaded version unless you are multiplying really huge matrices. The overhead involved in thread creation is just too high relative to the time required to multiply matrices. However, you could get a significant performance boost if you spawn all the worker threads once when your process starts and then reuse them over and over again to perform many matrix multiplications.
For each pair of matrices you want to multiply you will want to load the multiplicand and multiplier matrices into memory once and then allow all of your worker threads to access the memory simultaneously. This should be safe because those matrices will not be changing during the multiplication.
You should also be able to allow all the worker threads to write their output simultaneously into the same output matrix because (due to the nature of matrix multiplication) each thread will end up writing its output to different elements of the matrix and there will not be any contention.
I think you should distribute the rows between threads by maintaining an integer
NextRowToProcessthat is shared by all of the threads. Whenever a thread is ready to process another row it callsInterlockedIncrement(or whatever atomic increment operation you have available on your platform) to safely get the next row to process.