I have implemented one matrix multiplication with boost::numeric::ublas::matrix (see my full, working boost code)
Result result = read ();
boost::numeric::ublas::matrix<int> C;
C = boost::numeric::ublas::prod(result.A, result.B);
and another one with the standard algorithm (see full standard code):
vector< vector<int> > ijkalgorithm(vector< vector<int> > A,
vector< vector<int> > B) {
int n = A.size();
// initialise C with 0s
vector<int> tmp(n, 0);
vector< vector<int> > C(n, tmp);
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
return C;
}
This is how I test the speed:
time boostImplementation.out > boostResult.txt
diff boostResult.txt correctResult.txt
time simpleImplementation.out > simpleResult.txt
diff simpleResult.txt correctResult.txt
Both programs read a hard-coded textfile which contains two 2000 x 2000 matrices.
Both programs were compiled with these flags:
g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic
I got 15 seconds for my implementation and over 4 minutes for the boost-implementation!
edit: After compiling it with
g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out
I got 28.19 seconds for the ikj-algorithm and 60.99 seconds for Boost. So Boost is still considerably slower.
Why is boost so much slower than my implementation?
Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD.
Here’s the time taken by the uBLAS version with debugging on:
Here’s the time taken by the uBLAS version with debugging off (
-DNDEBUG -DBOOST_UBLAS_NDEBUGcompiler flags added):So with debugging off, uBLAS version is almost 3 times faster.
Remaining performance difference can be explained by quoting the following section of uBLAS FAQ “Why is uBLAS so much slower than (atlas-)BLAS”:
This generality almost always comes with a cost. In particular the
prodfunction template can handle different types of matrices, such as sparse or triangular ones. Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prod andblock_prod. Here are the results of comparing different methods:As you can see both
axpy_prodandblock_prodare somewhat faster than your implementation. Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size forblock_prod(I used 64) can make the difference more profound.See also uBLAS FAQ and Effective uBlas and general code optimization.