I’ve been trying to create template kernels but I’m been having some trouble calling

Question

0

Asked: May 23, 20262026-05-23T05:03:30+00:00 2026-05-23T05:03:30+00:00

I’ve been trying to create template kernels but I’m been having some trouble calling

0

I’ve been trying to create template kernels but I’m been having some trouble calling them in my program. I have a Matrix<T> template class, and some methods defined inside it

Matrix.h:

template <typename T> class Matrix {
    ...
    void sum(Matrix<T>& m1, Matrix<T>& m2, Matrix<T>& sum);
    ...
}

#include "Matrix.cu"

Matrix.cu:

#include "MatrixKernel.h"

template<typename T> void Matrix<T>::sum(const Matrix<T>& m, Matrix<T>& sum) {
    ...
    sumKernel<T><<<dimGrid, dimBlock>>>(Matrix<T> m1, Matrix<T> m2, Matrix<T> sum)
    ...
}

MatrixKernel.h:

template<typename T> __global__ void sumKernel(const Matrix<T> m1, const Matrix<T> m2, Matrix<T> sum) {
...
}

The problem is that when I call sumKernel from inside of sum, the compiler gives me the following error:

error C2059: syntax error : '<'

Does somebody know what’s going on? The code compiled fine just before I included the sumKernel call.

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T05:03:31+00:00

So, it seems you do have a strange #include, leading to code getting compiled by the wrong compiler. Make a distinction between gpu headers and cpu headers by using .cu.h for cuda headers. Make sure only NVCC compiles .cu and .cu.h files. Cuda files should never be included in cpp files. The kernel and kernel call should be in a .cu or .cu.h files, and those files shouldn’t be included anywhere in cpps.

Because your .cu is being included in a header which is being compiled by the host compiler, the host compiler ends up hitting the token <<< – which it doesn’t recognise. It probably does understand the token << so it consumes that, leaving an unexpected <.

Here’s an alternative way of doing things that should work (not tried it but it’s similar to code we use)

(note, this might work but it also might not be the right way to solve the problem. My boss doesn’t like it as a solution and would prefer to add an implementation per variation)

The underlying problem seems to be lack of distinction between host and device code. I’m leaving the detail out in my solution – things like copying results to and from the device, sum implementation, etc.

The problem I’m trying to solve is, given a construct, how can you template it for use both on the host and the device?

I’ll template Matrix.h on both the type and the implementation detail.

 template <typename T, typename Implementation<T> > class Matrix {
     void sum(Matrix<T>& m1, Matrix<T>& m2, Matrix<T>& sum)
     {
         Implementation.sumImp(m1, m2, sum);
     }
 }

The host implementation, HostMatrixSum.h will do things the on the cpu:

 #include "Matrix.h"

 template <typename T> struct HostMatrixSum
 {
     void sumImp(Matrix<T>& m1, Matrix<T>& m2, Matrix<T>& sum)
     {
         ...
     }
 }

While GpuMatrixSum.cu.h will upload the matrix, do the sum and recover the results:

 #include "Matrix.h"

 template <typename T> struct GpuMatrixSum
 {   
     template<typename T> __global__ void sumKernel(const Matrix<T> m1, const Matrix<T> m2, Matrix<T> sum)
     {
         ...
     }

     void sumImp(Matrix<T>& m1, Matrix<T>& m2, Matrix<T>& sum)
     {
         ...
         sumKernel<T> <<< dimGrid, dimBlock >>> (m1,m2);
         ...
     }
 }

Then when we come to use Matrix from host code we template on the host sum implementation and never need to see any cuda specifics:

 #include "Matrix.h"
 #include "HostMatrixSum.h"

 Matrix<int, HostMatrixSum> m1 = Matrix<int>(...);
 Matrix<int, HostMatrixSum> m2 = Matrix<int>(...);
 Matrix<int, HostMatrixSum> result;
 Matrix.sum(m1,m2,result);

And if we’re working on the gpu we can use the accelerated gpu implementation of sum:

 #include "Matrix.h"
 #include "GpuMatrixSum.cu.h"

 Matrix<int, GpuMatrixSum> m1 = Matrix<int>(...);
 Matrix<int, GpuMatrixSum> m2 = Matrix<int>(...);
 Matrix<int, GpuMatrixSum> result;
 Matrix.sum(m1,m2,result);

Hope that works for you!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve been trying to create template kernels but I’m been having some trouble calling

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply