Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8689317
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T23:36:58+00:00 2026-06-12T23:36:58+00:00

Nvidia Performance Primitives (NPP) provides the nppiFilter function for convolving a user-provided image with

  • 0

Nvidia Performance Primitives (NPP) provides the nppiFilter function for convolving a user-provided image with a user-provided kernel. For 1D convolution kernels, nppiFilter works properly. However, nppiFilter is producing a garbage image for 2D kernels.

I used the typical Lena image as input:
enter image description here


Here’s my experiment with a 1D convolution kernel, which produces good output.

#include <npp.h> // provided in CUDA SDK
#include <ImagesCPU.h> // these image libraries are also in CUDA SDK
#include <ImagesNPP.h>
#include <ImageIO.h>

void test_nppiFilter()
{
    npp::ImageCPU_8u_C1 oHostSrc;
    npp::loadImage("Lena.pgm", oHostSrc);
    npp::ImageNPP_8u_C1 oDeviceSrc(oHostSrc); // malloc and memcpy to GPU 
    NppiSize kernelSize = {3, 1}; // dimensions of convolution kernel (filter)
    NppiSize oSizeROI = {oHostSrc.width() - kernelSize.width + 1, oHostSrc.height() - kernelSize.height + 1};
    npp::ImageNPP_8u_C1 oDeviceDst(oSizeROI.width, oSizeROI.height); // allocate device image of appropriately reduced size
    npp::ImageCPU_8u_C1 oHostDst(oDeviceDst.size());
    NppiPoint oAnchor = {2, 1}; // found that oAnchor = {2,1} or {3,1} works for kernel [-1 0 1] 
    NppStatus eStatusNPP;

    Npp32s hostKernel[3] = {-1, 0, 1}; // convolving with this should do edge detection
    Npp32s* deviceKernel;
    size_t deviceKernelPitch;
    cudaMallocPitch((void**)&deviceKernel, &deviceKernelPitch, kernelSize.width*sizeof(Npp32s), kernelSize.height*sizeof(Npp32s));
    cudaMemcpy2D(deviceKernel, deviceKernelPitch, hostKernel,
                     sizeof(Npp32s)*kernelSize.width, // sPitch
                     sizeof(Npp32s)*kernelSize.width, // width
                     kernelSize.height, // height
                     cudaMemcpyHostToDevice);
    Npp32s divisor = 1; // no scaling

    eStatusNPP = nppiFilter_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
                                          oDeviceDst.data(), oDeviceDst.pitch(),
                                          oSizeROI, deviceKernel, kernelSize, oAnchor, divisor);

    cout << "NppiFilter error status " << eStatusNPP << endl; // prints 0 (no errors)
    oDeviceDst.copyTo(oHostDst.data(), oHostDst.pitch()); // memcpy to host
    saveImage("Lena_filter_1d.pgm", oHostDst); 
}

Output of the above code with kernel [-1 0 1] — it looks like a reasonable gradient image:
enter image description here


However, nppiFilter outputs a garbage image if I use a 2D convolution kernel. Here are the things that I changed from the above code to run with the 2D kernel [-1 0 1; -1 0 1; -1 0 1]:

NppiSize kernelSize = {3, 3};
Npp32s hostKernel[9] = {-1, 0, 1, -1, 0, 1, -1, 0, 1};
NppiPoint oAnchor = {2, 2}; // note: using anchor {1,1} or {0,0} causes error -24 (NPP_TEXTURE_BIND_ERROR)
saveImage("Lena_filter_2d.pgm", oHostDst);

Below is the output image using the 2D kernel [-1 0 1; -1 0 1; -1 0 1].

What am I doing wrong?

enter image description here

This StackOverflow post describes a similar problem, as shown in user Steenstrup’s image: http://1ordrup.dk/kasper/image/Lena_boxFilter5.jpg


A few final notes:

  • With the 2D kernel, for certain anchor values (e.g. NppiPoint oAnchor = {0, 0} or {1, 1}), I get error -24, which translates to NPP_TEXTURE_BIND_ERROR according to the NPP User Guide. This issue was mentioned briefly in this StackOverflow post.
  • This code is very verbose. This isn’t the main question, but does anyone have any suggestions for how to make this code more concise?
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T23:37:00+00:00Added an answer on June 12, 2026 at 11:37 pm

    You are using a 2D memory allocator for the kernel array. Kernel arrays are dense 1D arrays, not 2D strided arrays as the typical NPP image is.

    Simply replace the 2D CUDA malloc with a simple cuda malloc of size kernelWidth*kernelHeight*sizeof(Npp32s) and do a normal CUDA memcopy not memcopy 2D.

    //1D instead of 2D
    cudaMalloc((void**)&deviceKernel, kernelSize.width * kernelSize.height * sizeof(Npp32s));
    cudaMemcpy(deviceKernel, hostKernel, kernelSize.width * kernelSize.height * sizeof(Npp32s), cudaMemcpyHostToDevice);
    

    As an aside, a “scale factor” of 1 does not translate to no scaling. Scaling happens with factors 2^(-ScaleFactor).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

In the Nvidia Performance Primitives (NPP) image processing examples in the CUDA SDK distribution
According to NVIDIA, this is the fastest sum reduction kernel: template <unsigned int blockSize>
I developed a naive function for mirroring an image horizontally or vertically using CUDA
When looking at the name of the performance counters in NVIDIA Fermi architecture (the
I'm trying to compare GPU to CPU performance. For the NVIDIA GPU I've been
I've written a simple serial 1D convolution function (below). I'm also experimenting with GPU
What is the maximum number of concurrent kernels possible for NVIDIA devices of compute
I have problem running samples provided by Nvidia in their GPU Computing SDK (there's
In the NVIDIA README for the Quadro card X driver, there is this comment:
I want to use nVIDIA compiler to generate a shared library for my GNU

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.