Nvidia Performance Primitives (NPP) provides the nppiFilter function for convolving a user-provided image with

Question

0

Asked: June 12, 20262026-06-12T23:36:58+00:00 2026-06-12T23:36:58+00:00

Nvidia Performance Primitives (NPP) provides the nppiFilter function for convolving a user-provided image with

0

Nvidia Performance Primitives (NPP) provides the nppiFilter function for convolving a user-provided image with a user-provided kernel. For 1D convolution kernels, nppiFilter works properly. However, nppiFilter is producing a garbage image for 2D kernels.

I used the typical Lena image as input:
enter image description here

Here’s my experiment with a 1D convolution kernel, which produces good output.

#include <npp.h> // provided in CUDA SDK
#include <ImagesCPU.h> // these image libraries are also in CUDA SDK
#include <ImagesNPP.h>
#include <ImageIO.h>

void test_nppiFilter()
{
    npp::ImageCPU_8u_C1 oHostSrc;
    npp::loadImage("Lena.pgm", oHostSrc);
    npp::ImageNPP_8u_C1 oDeviceSrc(oHostSrc); // malloc and memcpy to GPU 
    NppiSize kernelSize = {3, 1}; // dimensions of convolution kernel (filter)
    NppiSize oSizeROI = {oHostSrc.width() - kernelSize.width + 1, oHostSrc.height() - kernelSize.height + 1};
    npp::ImageNPP_8u_C1 oDeviceDst(oSizeROI.width, oSizeROI.height); // allocate device image of appropriately reduced size
    npp::ImageCPU_8u_C1 oHostDst(oDeviceDst.size());
    NppiPoint oAnchor = {2, 1}; // found that oAnchor = {2,1} or {3,1} works for kernel [-1 0 1] 
    NppStatus eStatusNPP;

    Npp32s hostKernel[3] = {-1, 0, 1}; // convolving with this should do edge detection
    Npp32s* deviceKernel;
    size_t deviceKernelPitch;
    cudaMallocPitch((void**)&deviceKernel, &deviceKernelPitch, kernelSize.width*sizeof(Npp32s), kernelSize.height*sizeof(Npp32s));
    cudaMemcpy2D(deviceKernel, deviceKernelPitch, hostKernel,
                     sizeof(Npp32s)*kernelSize.width, // sPitch
                     sizeof(Npp32s)*kernelSize.width, // width
                     kernelSize.height, // height
                     cudaMemcpyHostToDevice);
    Npp32s divisor = 1; // no scaling

    eStatusNPP = nppiFilter_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
                                          oDeviceDst.data(), oDeviceDst.pitch(),
                                          oSizeROI, deviceKernel, kernelSize, oAnchor, divisor);

    cout << "NppiFilter error status " << eStatusNPP << endl; // prints 0 (no errors)
    oDeviceDst.copyTo(oHostDst.data(), oHostDst.pitch()); // memcpy to host
    saveImage("Lena_filter_1d.pgm", oHostDst); 
}

Output of the above code with kernel [-1 0 1] — it looks like a reasonable gradient image:
enter image description here

However, nppiFilter outputs a garbage image if I use a 2D convolution kernel. Here are the things that I changed from the above code to run with the 2D kernel [-1 0 1; -1 0 1; -1 0 1]:

NppiSize kernelSize = {3, 3};
Npp32s hostKernel[9] = {-1, 0, 1, -1, 0, 1, -1, 0, 1};
NppiPoint oAnchor = {2, 2}; // note: using anchor {1,1} or {0,0} causes error -24 (NPP_TEXTURE_BIND_ERROR)
saveImage("Lena_filter_2d.pgm", oHostDst);

Below is the output image using the 2D kernel [-1 0 1; -1 0 1; -1 0 1].

What am I doing wrong?

enter image description here

This StackOverflow post describes a similar problem, as shown in user Steenstrup’s image: http://1ordrup.dk/kasper/image/Lena_boxFilter5.jpg

A few final notes:

With the 2D kernel, for certain anchor values (e.g. NppiPoint oAnchor = {0, 0} or {1, 1}), I get error -24, which translates to NPP_TEXTURE_BIND_ERROR according to the NPP User Guide. This issue was mentioned briefly in this StackOverflow post.
This code is very verbose. This isn’t the main question, but does anyone have any suggestions for how to make this code more concise?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T23:37:00+00:00

You are using a 2D memory allocator for the kernel array. Kernel arrays are dense 1D arrays, not 2D strided arrays as the typical NPP image is.

Simply replace the 2D CUDA malloc with a simple cuda malloc of size kernelWidth*kernelHeight*sizeof(Npp32s) and do a normal CUDA memcopy not memcopy 2D.

//1D instead of 2D
cudaMalloc((void**)&deviceKernel, kernelSize.width * kernelSize.height * sizeof(Npp32s));
cudaMemcpy(deviceKernel, hostKernel, kernelSize.width * kernelSize.height * sizeof(Npp32s), cudaMemcpyHostToDevice);

As an aside, a “scale factor” of 1 does not translate to no scaling. Scaling happens with factors 2^(-ScaleFactor).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Nvidia Performance Primitives (NPP) provides the nppiFilter function for convolving a user-provided image with

What am I doing wrong?

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply