I worked on implementing a reasoner, a complex stuff. I tried to improve performance by employing parallelized threads, but only gained overhead.
My question is whether there are other potential bottlenecks besides monitors (locks). I removed all such indicators as synchronized and volatile from my program.
I use java.util.concurrent utilities, and split data into standalone arrays for threads.
The most useful think you can do is ensure you thread are performing long sequences of independent work. These sequences need to be significantly longer than the overhead you are likely to incur (say 1 – 10 micro-seconds)
A common mistake is to break up the work too finely (creating a lot of overhead in the process). You only need one task per core to keep every core busy.
Without most details of what you are trying to do and how you are breaking up your work, its hard to suggest anything more specific.