I am writing mex code in MATLAB to do and operation (because the operation uses a library in c++). The mex code has a section where there is a function that is repeatedly called in a loop with a different argument value, and each function call is independent (i.e., computation of 1 call does not depend on previous calls). So, to speed this up I wrote multithreaded code that creates multiple threads – the exact number of threads is equal to the number of loop iterations, in my example this value is 10. Each thread computes the function in the loop for a separate value of the argument, the threads return and join, some more computation is done and a result is returned.
All this in theory should give me good speedup, but I see that the multithreaded code is a lot slower than the normal single threaded one!! I have access to very powerful 24 core machines, so this is totally baffling, because I’d expected each thread to be scheduled on a separate core.
Any ideas to what is leading to this? Any common problems/errors in code that lead to this?
Any help will be greatly appreciated.
EDIT:
To answer many doubts raised in solutions proposed by people here, I want to share some information about my code:
1. Each function call takes a few minutes, so synchronization and spawning of threads should not be an overhead here (though if there are any mitigating circumstances in this case, any info about that would be really helpful!)
-
Each thread does access common data structures, arrays, matrices but the values in these are not overwritten at all. All writes to variables are done to variables, pointers, arrays, etc that are local to the thread. So, I am guessing there shouldn’t be many cache misses here?
-
Also there are no mutex sections in my code, since no thread write to any common memory location. All writes are to memory locations local to the thread.
I’m still trying to figure out the reason why my multithreaded implementation is not working 🙁 So, any pointers/info will be really helpful!
Thanks!!
Given how general your question is, the general answer is that there are probably two effects in play:
I would test the job with varying numbers of threads. It may turn out, for instance, that using two threads is advantageous, but four or more is not. For more detailed answers, add more details to the question, such as type of computation, size of dataset, etc.