Is it possible to get 3-6x speedup from the following simple class?
I am trying to make a class that pretends to be an inline function but the parenthesis/subsref operator overloading doesn’t go fast enough for me.
I created the class CTestOp to replace the inline function f = @(x) A*x by letting subsref take a vector and multiplying it against the class property A.
Benchmarks indicate that for small size A and x (say, m=5) it takes 4-7x as long to use the inline function as to just write A*x and it takes 4-7x as long to use the class as to use the inline function:
Elapsed time is 0.327328 seconds for the class
Elapsed time is 0.053322 seconds for the inline function.
Elapsed time is 0.011704 seconds for just writing A*x.
I have made a series of improvements to get here but there are problems. I can see substantial gains, for instance, by not asking for this.A but then that defeats the whole purpose. I would have liked to use an abstract class that allows us to write various operation functions—but while making the class abstract didn’t add much time at all, making the actual function call did.
Any ideas?
The class is:
classdef CTestOp < handle
properties
A = [];
end
methods
function this = CTestOp(A)
this.A = A;
end
function result = operation(this, x)
result = this.A*x;
end
function result = subsref(this, S)
% switch S.type
% case '()'
% result = this.operation(S.subs{1}); % Killed because this was really slow
% result = operation(this, S.subs{1}); % I wanted this, but it was too slow
result = this.A*S.subs{1};
% otherwise
% result = builtin('subsref', this, S);
% end
end
end
end
While the test code is:
m = 5;
A = randn(m,m);
x = randn(m,1);
f = @(x) A*x;
myOp = CTestOp(A);
nc = 10000;
% Try with the class:
tic
for ind = 1:nc
r_abs = myOp(x);
end
toc
% Try with the inline function:
tic
for ind = 1:nc
r_fp = f(x);
end
toc
% Try just inline. so fast!
tic
for ind = 1:nc
r_inline = A*x;
end
toc
If you want to write fast code in Matlab, the trick was always to vectorize the code.
The same holds for using Matlab OO. Though I am unable to test it at the moment I am quite confident that you can reduce the overhead by performing one big operation rather than many small ones.
In your specific example, you can run the benchmark again and see if my statement actually holds by changing these two lines: