I am trying to benchmark how fast can Java do a simple task: read a huge file into memory and then perform some meaningless calculations on the data. All types of optimizations count. Whether it’s rewriting the code differently or using a different JVM, tricking JIT ..
Input file is a 500 million long list of 32 bit integer pairs separated by a comma. Like this:
44439,5023
33140,22257
…
This file takes 5.5GB on my machine. The program can’t use more than 8GB of RAM and can use only a single thread.
package speedracer;
import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
public class Main
{
public static void main(String[] args)
{
int[] list = new int[1000000000];
long start1 = System.nanoTime();
parse(list);
long end1 = System.nanoTime();
System.out.println("Parsing took: " + (end1 - start1) / 1000000000.0);
int rs = 0;
long start2 = System.nanoTime();
for (int k = 0; k < list.length; k++) {
rs = calc(list[k++], list[k++], list[k++], list[k]);
}
long end2 = System.nanoTime();
System.out.println(rs);
System.out.println("Calculations took: " + (end2 - start2) / 1000000000.0);
}
public static int calc(final int a1, final int a2, final int b1, final int b2)
{
int c1 = (a1 + a2) ^ a2;
int c2 = (b1 - b2) << 4;
for (int z = 0; z < 100; z++) {
c1 ^= z + c2;
}
return c1;
}
public static void parse(int[] list)
{
FileChannel fc = null;
int i = 0;
MappedByteBuffer byteBuffer;
try {
fc = new FileInputStream("in.txt").getChannel();
long size = fc.size();
long allocated = 0;
long allocate = 0;
while (size > allocated) {
if ((size - allocated) > Integer.MAX_VALUE) {
allocate = Integer.MAX_VALUE;
} else {
allocate = size - allocated;
}
byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, allocated, allocate);
byteBuffer.clear();
allocated += allocate;
int number = 0;
while (byteBuffer.hasRemaining()) {
char val = (char) byteBuffer.get();
if (val == '\n' || val == ',') {
list[i] = number;
number = 0;
i++;
} else {
number = number * 10 + (val - '0');
}
}
}
fc.close();
} catch (Exception e) {
System.err.println("Parsing error: " + e);
}
}
}
I’ve tried all I could think of. Trying different readers, tried openjdk6, sunjdk6, sunjdk7. Tried different readers. Had to do some ugly parsing since MappedByteBuffer cannot map more than 2GB of memory at once. I’m running:
Linux AS292 2.6.38-11-generic #48-Ubuntu SMP
Fri Jul 29 19:02:55 UTC 2011
x86_64 GNU/Linux. Ubuntu 11.04.
CPU: is Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz.
Currently, my results are for parsing: 26.50s, calculations: 11.27s. I’m competing against a similar C++ benchmark which does the IO in roughly the same time but the calculations take only 4.5s. My main objective is to reduce the calculation time in any means possible. Any ideas?
Update: It seems the main speed improvement could come from what is called Auto-Vectorization. I was able to find some hints that the current Sun’s JIT only does “some vectorization” however I can’t really confirm it. It would be great to find some JVM or JIT that would have better auto-vectorization optimization support.
First of all,
-O3enables:among others…
So it looks like it actually might be vectorizing.
EDIT :
This has been been confirmed. (see comments) The C++ version is indeed being vectorized by the compiler. With vectorization disabled, the C++ version actually runs a bit slower than the Java version
Assuming the JIT does not vectorize the loop, it may be difficult/impossible for the Java version to match the speed of the C++ version.
Now, if I were a smart C/C++ compiler, here’s how I would arrange that loop (on x64):
Note that this loop is completely vectorizable.
Even better, I would completely unroll this loop. These are things that a C/C++ compiler will do. But now the question, is will the JIT do it?