Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8524495
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T07:42:08+00:00 2026-06-11T07:42:08+00:00

Writing some signal processing in CUDA I recently made huge progress in optimizing it.

  • 0

Writing some signal processing in CUDA I recently made huge progress in optimizing it. By using 1D textures and adjusting my access patterns I managed to get a 10× performance boost. (I previously tried transaction aligned prefetching from global into shared memory, but the nonuniform access patterns happening later messed up the warp→shared cache bank association (I think)).

So now I’m facing the problem, how CUDA textures and bindings interact with asynchronous memcpy.

Consider the following kernel

texture<...> mytexture;

__global__ void mykernel(float *pOut)
{
    pOut[threadIdx.x] = tex1Dfetch(texture, threadIdx.x);
}

The kernel is launched in multiple streams

extern void *sourcedata;

#define N_CUDA_STREAMS ...

cudaStream stream[N_CUDA_STREAMS];
void *d_pOut[N_CUDA_STREAMS];
void *d_texData[N_CUDA_STREAMS];

for(int k_stream = 0; k_stream < N_CUDA_STREAMS; k_stream++) {
    cudaStreamCreate(stream[k_stream]);

    cudaMalloc(&d_pOut[k_stream], ...);
    cudaMalloc(&d_texData[k_stream], ...);
}

/* ... */

for(int i_datablock; i_datablock < n_datablocks; i_datablock++) {
    int const k_stream = i_datablock % N_CUDA_STREAMS;
    cudaMemcpyAsync(d_texData[k_stream], (char*)sourcedata + i_datablock * blocksize, ..., stream[k_stream]);

    cudaBindTexture(0, &mytexture, d_texData[k_stream], ...);

    mykernel<<<..., stream[k_stream]>>>(d_pOut);
}

Now what I wonder about is, since there is only one texture reference, what happens when I bind a buffer to a texture while other streams’ kernels access that texture? cudaBindStream doesn’t take a stream parameter, so I’m worried that by binding the texture to another device pointer while running kernels are asynchronously accessing said texture I’ll divert their accesses to the other data.

The CUDA documentation doesn’t tell anything about this. If have to to disentangle this to allow concurrent access, it seems I’d have to create a number of texture references and use a switch statementto chose between them, based on the stream number passed as a kernel launch parameter.

Unfortunately CUDA doesn’t allow to put arrays of textures on the device side, i.e. the following does not work:

texture<...> texarray[N_CUDA_STREAMS];

Layered textures are not an option, because the amount of data I have only fits within a plain 1D texture not bound to a CUDA array (see table F-2 in the CUDA 4.2 C Programming Guide).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T07:42:10+00:00Added an answer on June 11, 2026 at 7:42 am

    Indeed you cannot unbind the texture while still using it in a different stream.

    Since the number of streams doesn’t need to be large to hide the asynchronous memcpys (2 would already do), you could use C++ templates to give each stream its own texture:

    texture<float, 1, cudaReadModeElementType> mytexture1;
    texture<float, 1, cudaReadModeElementType> mytexture2;
    
    template<int TexSel> __device__ float myTex1Dfetch(int x);
    
    template<> __device__ float myTex1Dfetch<1>(int x) { return tex1Dfetch(mytexture1, x); }
    template<> __device__ float myTex1Dfetch<2>(int x) { return tex1Dfetch(mytexture2, x); }
    
    
    template<int TexSel> __global__ void mykernel(float *pOut)
    {
        pOut[threadIdx.x] = myTex1Dfetch<TexSel>(threadIdx.x);
    }
    
    
    int main(void)
    {
        float *out_d[2];
    
        // ...
    
        mykernel<1><<<blocks, threads, stream[0]>>>(out_d[0]);
        mykernel<2><<<blocks, threads, stream[1]>>>(out_d[1]);
    
        // ...
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am writing some signal processing software, and I am starting off by writing
I am writing some signal processing code in C that has a communications channel.
So I was writing some queries today and was using top 10 on a
I'm currently writing a C# application that does a lot of digital signal processing,
Basically I have created two MATLAB functions which involve some basic signal processing and
I'm writing an iPhone application using Monotouch and recently the app has started crashing
I'm writing an application that scrapes some web pages using QWebPage. I'm having some
While writing some C code, I decided to compile it to assembly and read
Im writing some helper functions for a project im working on. I've always wanted
While writing some views to respond to ajax requests i find it somewhat strange

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.