Typically, a BLAS subroutine is defined for a certain unique operation. For instance,
DAXPY is necessarily y <-- ax + y
DSCAL is necessarily x = ax.
What I wish to achieve is:
z = ax+by and y = ax.
How do I “extend” the subroutines of BLAS so that I can do the above?
(These operations do not necessarily follow each other)
I have tried:
-
Declaring a dummy and then
DCOPYing the dummy to the desired vector. Like,DCOPY(dummy,x); DSCAL(a,dummy),DCOPY(y,dummy) -
Creating my own OpenMP implementation
-
Using tricks like,
DCOPY(y,a*x)for y=ax
But the problem is, none of these methods seem to give me a conclusive answer for which is the best way of getting around this problem. I know I should “Profile, Profile, Profile” rather than asking but I have tried all of that but everytime I change the vector a little, what was the best method earlier suddenly becomes the worst or vice versa.
Also,
- My intention is to bring about the best performance possible.
- I know that optimizing these operations won’t probably give me much performance boost but I’m trying to save every picosecond that I can.
- FWIW, I am linking to Intel MKL
First of all, in your explanation of y <- a x, you could remove one excessive copying by using DCOPY(y,x); DSCAL(a,y).
Second, OpenMP IMHO is not a solution for this kind of problems, because they are “memory bound”. The penalty lies in pipelining memory accesses with computations and vectorization, which uses more bandwidth by using vector memory accesses. Hand-tuned code should be very complex because of (branch-prediction, cache policies, register file configurations, etc.) You need something like Atlas library of R. Clint Whaley which automatically generates optimized operation implementation for a particular platform. AFAIK, there is BLAST standard (2001), maybe you’ll find similar variants of the operations you’ve presented. May be you need to e-mail them to add these operations to their autotuner.
As a starting point, I would recommend you use the following implementation of z = ax+by.
In this case z is written anyway, provided x and y are readonly, you could use:
DCOPY(z,y); DSCAL(b,z); DAXPY(a, x, z);
You could also read the articles about ATLAS project, which contain the main considerations about the key aspects of code optimization (the presence of madd operation, cache characteristics, register file configuration, instruction latencies, etc.) and try to write something like a codegenerator for your operations to pipeline execution of various operations and perform a search between various variants.
It’s an interesting topic, I’ve been implementing BLAS on a heterogeneous multicore architectures with explicitly-managed memory hierarchies, like a Cell processor. I wish you a good luck! Hope my answer is useful!