I have some classes implementing some computations which I have
to optimize for different SIMD implementations e.g. Altivec and
SSE. I don’t want to polute the code with #ifdef ... #endif blocks
for each method I have to optimize so I tried a couple of other
approaches, but unfotunately I’m not very satisfied of how it turned
out for reasons I’ll try to clarify. So I’m looking for some advice
on how I could improve what I have already done.
1.Different implementation files with crude includes
I have the same header file describing the class interface with different
“pseudo” implementation files for plain C++, Altivec and SSE only for the
relevant methods:
// Algo.h
#ifndef ALGO_H_INCLUDED_
#define ALGO_H_INCLUDED_
class Algo
{
public:
Algo();
~Algo();
void process();
protected:
void computeSome();
void computeMore();
};
#endif
// Algo.cpp
#include "Algo.h"
Algo::Algo() { }
Algo::~Algo() { }
void Algo::process()
{
computeSome();
computeMore();
}
#if defined(ALTIVEC)
#include "Algo_Altivec.cpp"
#elif defined(SSE)
#include "Algo_SSE.cpp"
#else
#include "Algo_Scalar.cpp"
#endif
// Algo_Altivec.cpp
void Algo::computeSome()
{
}
void Algo::computeMore()
{
}
... same for the other implementation files
Pros:
- the split is quite straightforward and easy to do
- there is no “overhead”(don’t know how to say it better) to objects of my class
by which I mean no extra inheritance, no addition of member variables etc. - much cleaner than
#ifdef-ing all over the place
Cons:
- I have three additional files for maintenance; I could put the Scalar
implementation in the Algo.cpp file though and end up with just two but the
inclusion part will look and fell a bit dirtier - they are not compilable units per-se and have to be excluded from the
project structure - if I do not have the specific optimized implementation yet for let’s say
SSE I would have to duplicate some code from the plain(Scalar) C++ implementation file - I cannot fallback to the plain C++ implementation if nedded; ? is it even possible
to do that in the described scenario ? - I do not feel any structural cohesion in the approach
2.Different implementation files with private inheritance
// Algo.h
class Algo : private AlgoImpl
{
... as before
}
// AlgoImpl.h
#ifndef ALGOIMPL_H_INCLUDED_
#define ALGOIMPL_H_INCLUDED_
class AlgoImpl
{
protected:
AlgoImpl();
~AlgoImpl();
void computeSomeImpl();
void computeMoreImpl();
};
#endif
// Algo.cpp
...
void Algo::computeSome()
{
computeSomeImpl();
}
void Algo::computeMore()
{
computeMoreImpl();
}
// Algo_SSE.cpp
AlgoImpl::AlgoImpl()
{
}
AlgoImpl::~AlgoImpl()
{
}
void AlgoImpl::computeSomeImpl()
{
}
void AlgoImpl::computeMoreImpl()
{
}
Pros:
- the split is quite straightforward and easy to do
- much cleaner than
#ifdef-ing all over the place - still there is no “overhead” to my class – EBCO should kick in
- the semantic of the class is much more cleaner at least comparing to the above
that isprivate inheritance == is implemented in terms of - the different files are compilable, can be included in the project
and selected via the build system
Cons:
- I have three additional files for maintenance
- if I do not have the specific optimized implementation yet for let’s say
SSE I would have to duplicate some code from the plain(Scalar) C++ implementation file - I cannot fallback to the plain C++ implementation if nedded
3.Is basically method 2 but with virtual functions in the AlgoImpl class. That
would allow me to overcome the duplicate implementation of plain C++ code if needed
by providing an empty implementation in the base class and override in the derived
although I will have to disable that behavior when I actually implement the optimized
version. Also the virtual functions will bring some “overhead” to objects of my class.
4.A form of tag dispatching via enable_if<>
Pros:
- the split is quite straightforward and easy to do
- much cleaner than #ifdef ing all over the place
- still there is no “overhead” to my class
- will eliminate the need for different files for different implementations
Cons:
- templates will be a bit more “cryptic” and seem to bring an unnecessary
overhead(at least for some people in some contexts) - if I do not have the specific optimized implementation yet for let’s say
SSE I would have to duplicate some code from the plain(Scalar) C++ implementation - I cannot fallback to the plain C++ implementation if needed
What I couldn’t figure out yet for any of the variants is how to properly and
cleanly fallback to the plain C++ implementation.
Also I don’t want to over-engineer things and in that respect the first variant
seems the most “KISS” like even considering the disadvantages.
As requested in the comments, here’s a summary of what I did:
Set up
policy_listhelper template utilityThis maintains a list of policies, and gives them a “runtime check” call before calling the first suitable implementaiton
Set up specific policies
These policies implement a both a runtime test and an actual implementation of the algorithm in question. For my actual problem impl took another template parameter that specified what exactly it was they were implementing, here though the example assumes there is only one thing to be implemented. The runtime tests are cached in a
static boolfor some (e.g. the Altivec one I used) the test was really slow. For others (e.g. the OpenCL one) the test is actually “is this function pointerNULL?” after one attempt at setting it withdlsym().Set per architecture
policy_listTrivial example sets one of two possible lists based on
ARCH_HAS_SSEpreprocessor macro. You might generate this from your build script, or use a series oftypedefs, or hack support for “holes” in thepolicy_listthat might be void on some architectures skipping straight to the next one, without trying to check for support. GCC sets some preprocessor macors for you that might help, e.g.__SSE2__.You can use this to compile multiple variants on the same platform too, e.g. and SSE and no-SSE binary on x86.
Use the policy list
Fairly straightforward, call the
apply()static method on thepolicy_list. Trust that it will call theimpl()method on the first policy that passes the runtime test.If you take the “per operation template” approach I mentioned earlier it might be something more like:
In that case you end up making your
MatrixandVectortypes aware of thepolicy_listin order that they can decide how/where to store the data. You can also use heuristics for this too, e.g. “small vector/matrix lives in main memory no matter what” and make theruntime_check()or another function test the appropriateness of a particular approach to a given implementation for a specific instance.I also had a custom allocator for containers, which produced suitably aligned memory always on any SSE/Altivec enabled build, regardless of if the specific machine had support for Altivec. It was just easier that way, although it could be a
typedefin a given policy and you always assume that the highest priority policy has the strictest allocator needs.Example
have_altivec():I’ve included a sample
have_altivec()implementation for completeness, simply because it’s the shortest and therefore most appropriate for posting here. The x86/x86_64 CPUID one is messy because you have to support the compiler specific ways of writing inline ASM. The OpenCL one is messy because we check some of the implementation limits and extensions too.Conclusion
Basically you pay no penalty for platforms that can never support an implementation (the compiler generates no code for them) and only a small penalty (potentially just a very predictable by the CPU test/jmp pair if your compiler is half-decent at optimising) for platforms that could support something but don’t. You pay no extra cost for platforms that the first choice implementation runs on. The details of the runtime tests vary between the technology in question.