I have some classes implementing some computations which I have to optimize for different

Question

0

Asked: May 26, 20262026-05-26T03:20:18+00:00 2026-05-26T03:20:18+00:00

I have some classes implementing some computations which I have to optimize for different

0

I have some classes implementing some computations which I have
to optimize for different SIMD implementations e.g. Altivec and
SSE. I don’t want to polute the code with #ifdef ... #endif blocks
for each method I have to optimize so I tried a couple of other
approaches, but unfotunately I’m not very satisfied of how it turned
out for reasons I’ll try to clarify. So I’m looking for some advice
on how I could improve what I have already done.

1.Different implementation files with crude includes

I have the same header file describing the class interface with different
“pseudo” implementation files for plain C++, Altivec and SSE only for the
relevant methods:

// Algo.h
#ifndef ALGO_H_INCLUDED_
#define ALGO_H_INCLUDED_
class Algo
{
public:
    Algo();
    ~Algo();

    void process();
protected:
    void computeSome();
    void computeMore();
};
#endif

// Algo.cpp
#include "Algo.h"
Algo::Algo() { }

Algo::~Algo() { }

void Algo::process()
{
    computeSome();
    computeMore();
}

#if defined(ALTIVEC)
#include "Algo_Altivec.cpp" 
#elif defined(SSE)
#include "Algo_SSE.cpp"
#else
#include "Algo_Scalar.cpp"
#endif

// Algo_Altivec.cpp
void Algo::computeSome()
{
}
void Algo::computeMore()
{
}
... same for the other implementation files

Pros:

the split is quite straightforward and easy to do
there is no “overhead”(don’t know how to say it better) to objects of my class
by which I mean no extra inheritance, no addition of member variables etc.
much cleaner than #ifdef-ing all over the place

Cons:

I have three additional files for maintenance; I could put the Scalar
implementation in the Algo.cpp file though and end up with just two but the
inclusion part will look and fell a bit dirtier
they are not compilable units per-se and have to be excluded from the
project structure
if I do not have the specific optimized implementation yet for let’s say
SSE I would have to duplicate some code from the plain(Scalar) C++ implementation file
I cannot fallback to the plain C++ implementation if nedded; ? is it even possible
to do that in the described scenario ?
I do not feel any structural cohesion in the approach

2.Different implementation files with private inheritance

// Algo.h
class Algo : private AlgoImpl
{
 ... as before
}

// AlgoImpl.h
#ifndef ALGOIMPL_H_INCLUDED_
#define ALGOIMPL_H_INCLUDED_
class AlgoImpl
{
protected:
    AlgoImpl();
    ~AlgoImpl();

   void computeSomeImpl();
   void computeMoreImpl();
};
#endif

// Algo.cpp
...
void Algo::computeSome()
{
    computeSomeImpl();
}
void Algo::computeMore()
{
    computeMoreImpl();
}

// Algo_SSE.cpp
AlgoImpl::AlgoImpl()
{
}
AlgoImpl::~AlgoImpl()
{
}
void AlgoImpl::computeSomeImpl()
{
}
void AlgoImpl::computeMoreImpl()
{
}

Pros:

the split is quite straightforward and easy to do
much cleaner than #ifdef-ing all over the place
still there is no “overhead” to my class – EBCO should kick in
the semantic of the class is much more cleaner at least comparing to the above
that is private inheritance == is implemented in terms of
the different files are compilable, can be included in the project
and selected via the build system

Cons:

I have three additional files for maintenance
if I do not have the specific optimized implementation yet for let’s say
SSE I would have to duplicate some code from the plain(Scalar) C++ implementation file
I cannot fallback to the plain C++ implementation if nedded

3.Is basically method 2 but with virtual functions in the AlgoImpl class. That
would allow me to overcome the duplicate implementation of plain C++ code if needed
by providing an empty implementation in the base class and override in the derived
although I will have to disable that behavior when I actually implement the optimized
version. Also the virtual functions will bring some “overhead” to objects of my class.

4.A form of tag dispatching via enable_if<>

Pros:

the split is quite straightforward and easy to do
much cleaner than #ifdef ing all over the place
still there is no “overhead” to my class
will eliminate the need for different files for different implementations

Cons:

templates will be a bit more “cryptic” and seem to bring an unnecessary
overhead(at least for some people in some contexts)
if I do not have the specific optimized implementation yet for let’s say
SSE I would have to duplicate some code from the plain(Scalar) C++ implementation
I cannot fallback to the plain C++ implementation if needed

What I couldn’t figure out yet for any of the variants is how to properly and
cleanly fallback to the plain C++ implementation.

Also I don’t want to over-engineer things and in that respect the first variant
seems the most “KISS” like even considering the disadvantages.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T03:20:18+00:00

As requested in the comments, here’s a summary of what I did:

Set up `policy_list` helper template utility

This maintains a list of policies, and gives them a “runtime check” call before calling the first suitable implementaiton

#include <cassert>

template <typename P, typename N=void>
struct policy_list {
  static void apply() {
    if (P::runtime_check()) {
      P::impl();
    }
    else {
      N::apply();
    }
  }
};

template <typename P>
struct policy_list<P,void> {
  static void apply() {
    assert(P::runtime_check());
    P::impl();
  }
};

Set up specific policies

These policies implement a both a runtime test and an actual implementation of the algorithm in question. For my actual problem impl took another template parameter that specified what exactly it was they were implementing, here though the example assumes there is only one thing to be implemented. The runtime tests are cached in a static bool for some (e.g. the Altivec one I used) the test was really slow. For others (e.g. the OpenCL one) the test is actually “is this function pointer NULL?” after one attempt at setting it with dlsym().

#include <iostream>

// runtime SSE detection (That's another question!)
extern bool have_sse();

struct sse_policy {
  static void impl() {
    std::cout << "SSE" << std::endl;
  }

  static bool runtime_check() {
    static bool result = have_sse();
    // have_sse lives in another TU and does some cpuid asm stuff
    return result;
  }
};

// Runtime OpenCL detection
extern bool have_opencl();

struct opencl_policy {
  static void impl() {
    std::cout << "OpenCL" << std::endl;
  }

  static bool runtime_check() {
    static bool result = have_opencl();
    // have_opencl lives in another TU and does some LoadLibrary or dlopen()
    return result;
  }
};

struct basic_policy {
  static void impl() {
    std::cout << "Standard C++ policy" << std::endl;
  }

  static bool runtime_check() { return true; } // All implementations do this
};

Set per architecture `policy_list`

Trivial example sets one of two possible lists based on ARCH_HAS_SSE preprocessor macro. You might generate this from your build script, or use a series of typedefs, or hack support for “holes” in the policy_list that might be void on some architectures skipping straight to the next one, without trying to check for support. GCC sets some preprocessor macors for you that might help, e.g. __SSE2__.

#ifdef ARCH_HAS_SSE
typedef policy_list<opencl_policy,
        policy_list<sse_policy,
        policy_list<basic_policy
                    > > > active_policy;
#else
typedef policy_list<opencl_policy,
        policy_list<basic_policy
                    > > active_policy;
#endif

You can use this to compile multiple variants on the same platform too, e.g. and SSE and no-SSE binary on x86.

Use the policy list

Fairly straightforward, call the apply() static method on the policy_list. Trust that it will call the impl() method on the first policy that passes the runtime test.

int main() {
  active_policy::apply();
}

If you take the “per operation template” approach I mentioned earlier it might be something more like:

int main() {
  Matrix m1, m2;
  Vector v1;

  active_policy::apply<matrix_mult_t>(m1, m2);
  active_policy::apply<vector_mult_t>(m1, v1);
}

In that case you end up making your Matrix and Vector types aware of the policy_list in order that they can decide how/where to store the data. You can also use heuristics for this too, e.g. “small vector/matrix lives in main memory no matter what” and make the runtime_check() or another function test the appropriateness of a particular approach to a given implementation for a specific instance.

I also had a custom allocator for containers, which produced suitably aligned memory always on any SSE/Altivec enabled build, regardless of if the specific machine had support for Altivec. It was just easier that way, although it could be a typedef in a given policy and you always assume that the highest priority policy has the strictest allocator needs.

Example `have_altivec()`:

I’ve included a sample have_altivec() implementation for completeness, simply because it’s the shortest and therefore most appropriate for posting here. The x86/x86_64 CPUID one is messy because you have to support the compiler specific ways of writing inline ASM. The OpenCL one is messy because we check some of the implementation limits and extensions too.

#if HAVE_SETJMP && !(defined(__APPLE__) && defined(__MACH__))
jmp_buf jmpbuf;

void illegal_instruction(int sig) {
   // Bad in general - https://www.securecoding.cert.org/confluence/display/seccode/SIG32-C.+Do+not+call+longjmp%28%29+from+inside+a+signal+handler
   // But actually Ok on this platform in this scenario
   longjmp(jmpbuf, 1);
}
#endif

bool have_altivec()
{
    volatile sig_atomic_t altivec = 0;
#ifdef __APPLE__
    int selectors[2] = { CTL_HW, HW_VECTORUNIT };
    int hasVectorUnit = 0;
    size_t length = sizeof(hasVectorUnit);
    int error = sysctl(selectors, 2, &hasVectorUnit, &length, NULL, 0);
    if (0 == error)
        altivec = (hasVectorUnit != 0);
#elif HAVE_SETJMP_H
    void (*handler) (int sig);
    handler = signal(SIGILL, illegal_instruction);
    if (setjmp(jmpbuf) == 0) {
        asm volatile ("mtspr 256, %0\n\t" "vand %%v0, %%v0, %%v0"::"r" (-1));
        altivec = 1;
    }
    signal(SIGILL, handler);
#endif

    return altivec;
}

Conclusion

Basically you pay no penalty for platforms that can never support an implementation (the compiler generates no code for them) and only a small penalty (potentially just a very predictable by the CPU test/jmp pair if your compiler is half-decent at optimising) for platforms that could support something but don’t. You pay no extra cost for platforms that the first choice implementation runs on. The details of the runtime tests vary between the technology in question.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have some classes implementing some computations which I have to optimize for different

Leave an answerCancel reply

1 Answer

Set up policy_list helper template utility

Set up specific policies

Set per architecture policy_list

Use the policy list

Example have_altivec():

Conclusion

Leave an answer
Cancel reply

Set up `policy_list` helper template utility

Set per architecture `policy_list`

Example `have_altivec()`: