Problem
I have a piece of java code (JDK 1.6.0._22 if relevant) that implements a stateless, side effect free function with no mutexes. It does however use a lot of memory (I don’t know if that is relevant).
In the past I have visited Sun Laboratories and gathered the standard “performance vs number of threads” curve. As this function has no mutexs, it has a nice graph although the garbage collection kicked in as the number of threads increased. After some garbage collection tuning I was able to make this curve almost flat.
I am now doing the same experiment on Intel hardware. The hardware has 4 CPUs each with 8 cores, and hyperthreading. This gives 64 availableProcessors(). Unfortunately the curve of “performance vs number of threads” scales nicely for 1, 2, 3 threads, and caps at 3 threads. After 3 threads I can put as many threads as I want to the task, and the performance gets no better
Attempts to fix the Problem
My first thought was that I had been stupid and introduced some synchronised code somewhere. Normally to resolve this issue I run JConsole or JVisualVM, and look at the thread stacktraces. If I have 64 threads running at the speed of 3, I would expect 61 of them to be sitting waiting to enter a mutex. I didn’t find this. Instead I found all the threads running: just very slowly.
A second thought was that perhaps the timing framework was introducing problems. I replaced my function with a dummy function that just counts to a billion using an AtomicLong. This scaled beautifully with number of threads: I was able to count to a billion 10,000 times 64 times quicker with with 64 threads than with 1 thread.
I thought (desperation kicking in) perhaps garbage collection is taking a really really long time, so I tweaked the garbage collection parameters. While this improved my latency variation, it had no effect on throughput: I still have 64 threads running at the speed I expect 3 to run at.
I have downloaded the intel tool VTunes, but my skill with it is weak: it is a complex tool and I don’t understand it yet. I have the instruction book on order: a fun Christmas present to myself, but that is a little too late to help my current problem
Question
- What tools (mental or software) could I use to improve my understanding of what is going on?
- What mechanisms other than mutexs or garbage collection could be slowing my code down?
Well many experiments later I discover that the JVM makes no difference, but I also discovered the power of JDump. 50 of the 64 threads were at the following line.
Random.next looks like this
Most interestingly is that this isn’t an obvious lock, so the tools I using to spot mutexes weren’t working.
So it looks as though any creation of java hashmaps causes applications to stop being scalable (I exaggerate but not much). My application does make heavy use of hashmaps, so I guess I either rewrite hashmap or rewrite the application.
I’m raising a separate question to see how to deal with this.
Thanks for all the help