I’ve been trying to create template kernels but I’m been having some trouble calling them in my program. I have a Matrix<T> template class, and some methods defined inside it
Matrix.h:
template <typename T> class Matrix {
...
void sum(Matrix<T>& m1, Matrix<T>& m2, Matrix<T>& sum);
...
}
#include "Matrix.cu"
Matrix.cu:
#include "MatrixKernel.h"
template<typename T> void Matrix<T>::sum(const Matrix<T>& m, Matrix<T>& sum) {
...
sumKernel<T><<<dimGrid, dimBlock>>>(Matrix<T> m1, Matrix<T> m2, Matrix<T> sum)
...
}
MatrixKernel.h:
template<typename T> __global__ void sumKernel(const Matrix<T> m1, const Matrix<T> m2, Matrix<T> sum) {
...
}
The problem is that when I call sumKernel from inside of sum, the compiler gives me the following error:
error C2059: syntax error : '<'
Does somebody know what’s going on? The code compiled fine just before I included the sumKernel call.
Thanks.
So, it seems you do have a strange
#include, leading to code getting compiled by the wrong compiler. Make a distinction between gpu headers and cpu headers by using .cu.h for cuda headers. Make sure only NVCC compiles.cuand.cu.hfiles. Cuda files should never be included in cpp files. The kernel and kernel call should be in a.cuor.cu.hfiles, and those files shouldn’t be included anywhere in cpps.Because your
.cuis being included in a header which is being compiled by the host compiler, the host compiler ends up hitting the token<<<– which it doesn’t recognise. It probably does understand the token<<so it consumes that, leaving an unexpected<.Here’s an alternative way of doing things that should work (not tried it but it’s similar to code we use)
(note, this might work but it also might not be the right way to solve the problem. My boss doesn’t like it as a solution and would prefer to add an implementation per variation)
The underlying problem seems to be lack of distinction between host and device code. I’m leaving the detail out in my solution – things like copying results to and from the device, sum implementation, etc.
The problem I’m trying to solve is, given a construct, how can you template it for use both on the host and the device?
I’ll template
Matrix.hon both the type and the implementation detail.The host implementation,
HostMatrixSum.hwill do things the on the cpu:While
GpuMatrixSum.cu.hwill upload the matrix, do the sum and recover the results:Then when we come to use Matrix from host code we template on the host sum implementation and never need to see any cuda specifics:
And if we’re working on the gpu we can use the accelerated gpu implementation of sum:
Hope that works for you!