I’m writing now a program to study MPI. Okay, I’d write a program that multiplies square matrices.
long **multiplyMatrices(long **matrix1, long **matrix2, long capacity)
{
long **resultMatrix = new long*[capacity];
for (long i = 0; i < capacity; ++i) {
resultMatrix[i] = new long[capacity];
}
for (long i = 0, j, k; i < capacity; ++i) {
for (j = 0; j < capacity; ++j) {
resultMatrix[i][j] = 0;
for (k = 0; k < capacity; ++k) {
resultMatrix[i][j] = resultMatrix[i][j] + matrix1[i][k] * matrix2[k][j];
}
}
}
return resultMatrix;
}
Where capacity == 1000.
Okay, on localhost (Mac Mini 2012, Core i7, OS X 10.8.2) I compile this code in XCode with LLVM. Calculation takes 17 seconds. Yes, in one thread.
On remote host (Sun OS 5.11, dual-core CPU, 8 vCPU) I compile it with
g++ -I/usr/openmpi/ompi-1.5/include -I/usr/openmpi/ompi-1.5/include/openmpi -O2 main.cpp -R/opt/mx/lib -R/usr/openmpi/ompi-1.5/lib -L/usr/openmpi/ompi-1.5/lib -lmpi -lopen-rte -lopen-pal -lnsl -lrt -lm -ldl -lsocket -o main
or just
g++ -O2 main.cpp -o main
But… mpirun main takes 152 seconds to calculate this… What’s wrong? Am I missing something? Is that’s about server’s CPU’s architecture?
The main answer is in memory management.
Look at those lines
All lines are located in different places of memory, not as a whole block. We know how physical memory are presented on Mac Mini — 2 pieces of plastic, but on server it may be even different hosts (cluster).
Now we’ll try to fix this.
This boosts code running on Mac Mini to 9.8 seconds, on server — to 58 seconds.
But I still don’t know where are other time leaks. Maybe I should somehow optimize looping one of matrices.