I wrote a couple of Java classes—SingleThreadedCompute and MultithreadedCompute—to demonstrate the fact (or what I always thought was a fact!) that if you parallelize a compute-centric (no I/O) task on a single core machine, you don’t get a speedup. Indeed my understanding is that parallelizing such tasks actually slows things down because now you have to deal with context-switching overhead. Well, I ran the classes and the parallel version unexpectedly runs faster: the single-threaded version is consistently running at just over 7 seconds on my machine, and the multithreaded version is consistently running at just over 6 seconds on my machine. Can anybody explain how this is possible?
Here are the classes if anybody wants to look or try it for themselves.
public final class SingleThreadedCompute { private static final long _1B = 1000000000L; // one billion public static void main(String[] args) { long startMs = System.currentTimeMillis(); long total = 0; for (long i = 0; i < _1B; i++) { total += i; } System.out.println('total=' + total); long elapsedMs = System.currentTimeMillis() - startMs; System.out.println('Elapsed time: ' + elapsedMs + ' ms'); } }
Here’s the multithreaded version:
public final class MultithreadedCompute { private static final long _1B = 1000000000L; // one billion private static final long _100M = _1B / 10L; public static void main(String[] args) { long startMs = System.currentTimeMillis(); System.out.println('Creating workers'); Worker[] workers = new Worker[10]; for (int i = 0; i < 10; i++) { workers[i] = new Worker(i * _100M, (i+1) * _100M); } System.out.println('Starting workers'); for (int i = 0; i < 10; i++) { workers[i].start(); } for (int i = 0; i < 10; i++) { try { workers[i].join(); System.out.println('Joined with thread ' + i); } catch (InterruptedException e) { /* can't happen */ } } System.out.println('Summing worker totals'); long total = 0; for (int i = 0; i < 10; i++) { total += workers[i].getTotal(); } System.out.println('total=' + total); long elapsedMs = System.currentTimeMillis() - startMs; System.out.println('Elapsed time: ' + elapsedMs + ' ms'); } private static class Worker extends Thread { private long start, end; private long total; public Worker(long start, long end) { this.start = start; this.end = end; } public void run() { System.out.println('Computing sum ' + start + ' + ... + (' + end + ' - 1)'); for (long i = start; i < end; i++) { total += i; } } public long getTotal() { return total; } } }
Here’s the output from running the single-threaded version:
total=499999999500000000 Elapsed time: 7031 ms
And here’s the output from running the multithreaded version:
Creating workers Starting workers Computing sum 0 + ... + (100000000 - 1) Computing sum 100000000 + ... + (200000000 - 1) Computing sum 200000000 + ... + (300000000 - 1) Computing sum 300000000 + ... + (400000000 - 1) Computing sum 400000000 + ... + (500000000 - 1) Computing sum 500000000 + ... + (600000000 - 1) Computing sum 600000000 + ... + (700000000 - 1) Computing sum 700000000 + ... + (800000000 - 1) Computing sum 800000000 + ... + (900000000 - 1) Computing sum 900000000 + ... + (1000000000 - 1) Joined with thread 0 Joined with thread 1 Joined with thread 2 Joined with thread 3 Joined with thread 4 Joined with thread 5 Joined with thread 6 Joined with thread 7 Joined with thread 8 Joined with thread 9 Summing worker totals total=499999999500000000 Elapsed time: 6172 ms
EDIT: Information on the environment:
- Microsoft Windows XP Professional Version 2002, SP3
- Dell Precision 670
- Intel Xeon CPU 2.80GHz, 1 MB L2 cache
Not sure how to prove it’s a single core machine other than by stating the spec above and by noting that back when I bought the machine (Aug 2005), single cores were the standard and I didn’t upgrade to multicore (if that was even an option… I don’t remember). If there’s somewhere in Windows I can check other than System Properties (which shows the info above) let me know and I’ll check.
Here are five consecutive ST and MT runs:
FIVE SINGLETHREADED RUNS:
total=499999999500000000 Elapsed time: 7000 ms
total=499999999500000000 Elapsed time: 7031 ms
total=499999999500000000 Elapsed time: 6922 ms
total=499999999500000000 Elapsed time: 6968 ms
total=499999999500000000 Elapsed time: 6938 ms
FIVE MULTITHREADED RUNS:
total=499999999500000000 Elapsed time: 6047 ms
total=499999999500000000 Elapsed time: 6141 ms
total=499999999500000000 Elapsed time: 6063 ms
total=499999999500000000 Elapsed time: 6282 ms
total=499999999500000000 Elapsed time: 6125 ms
I tried turning off the JIT as Pax suggested in the comment above. Pax, if you want to post a quick ‘Turn off the JIT’ answer I’ll credit your solution.
Anyway turning off the JIT worked (meaning that it brought the actual results in line with the expected results). I had to back away from one billion as it was taking forever, so I went with 100 million instead. The results are much more in line with what I would expect. Here they are:
FIVE NO-JIT SINGLE-THREADED RUNS
total=4999999950000000 Elapsed time: 17094 ms
total=4999999950000000 Elapsed time: 17109 ms
total=4999999950000000 Elapsed time: 17219 ms
total=4999999950000000 Elapsed time: 17375 ms
total=4999999950000000 Elapsed time: 17125 ms
FIVE NO-JIT MULTITHREADED RUNS
total=4999999950000000 Elapsed time: 18719 ms
total=4999999950000000 Elapsed time: 18750 ms
total=4999999950000000 Elapsed time: 18610 ms
total=4999999950000000 Elapsed time: 18890 ms
total=4999999950000000 Elapsed time: 18719 ms
Thanks guys for the ideas and help.