I installed Eclipse 3.7 version JavaEE for Win64, and then followed installation instructions for Ateji from the manual version 1.2.
The results I get from running the speedup example for I = J = 100000:
PERFORMANCE COMPARISON BETWEEN SEQUENTIAL AND PARALLEL COMPREHENSIONS
sequential sum:
`+ for (int i : I, int j : J) (i*j);
parallel sum:
`+ for || (int i : I, int j : J) (i*j);
data size : I = 100000; J = 100000
Wait for the result...
sequential sum: mean time = 202 ms; standard deviation = 1 ms; ( 8473 8460 203 202 202 204 203 202 205 202 203 202 203 204 203 202 204 202 203 203 )
parallel sum: mean time = 2017 ms; standard deviation = 961.311 ms; ( 1787 1800 1790 1847 1457 1442 1698 1457 1455 1439 1467 4083 3239 1461 1458 1469 1470 1469 3077 4311 )
Speed up = 0.10014873574615767
Available processors = 8
My monitor for processor activity shows that the 4 cores are indeed used in the parallel task.
The hello world example works (“hello” and “world” get printed, in random order).
I checked the troubleshooting section of the Ateji manual and all is correct (I used a JDK and a JRE 1.7)
Where could the problem come from? Thanks!
What teaches this surprising result is that you shouldn’t rely on microbenchmarks.
On my 4 core laptop, I get the expected speedup with a Java6 VM (1.6.0_22-b04 HotSpot(TM) 64-Bit Server):
On the same machine, I get the surprising result you mention with a Java7 VM (1.7.0_03-b05 HotSpot(TM) 64-Bit Server):
Note how the sequential time has been divided by a factor of 50 between the two VM versions !!! This is definitely a sign that a powerful optimization has kicked in.
A clever VM could go as far as not doing any computation (time = 0ms) since it is possible to express statically the result of the sum as a simple algebraic expression. There must be something in the parallel version of the code that precludes the same optimization, hence the surpringly results you see.
Indeed, if you change the summation expression to the more realistic
where summands are taken from input arrays, so the sum cannot be optimized away, then you get speed-up results more in line with your expectations:
JRE6
JRE7
The lower speedup figures are due to concurrent access to the arrays x and y. Using a local copy of the array for each core would probably provide a speedup close to 4 as in the original example.
Patrick