Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8269445
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T06:13:44+00:00 2026-06-08T06:13:44+00:00

Cuda is awesome and I’m using it like crazy but I am not using

  • 0

Cuda is awesome and I’m using it like crazy but I am not using her full potential because I’m having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there’s a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I’m looking for then it saves the results (which is a 30 int list for each analyzed item).

The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). I’ve been forced to make it smaller and send less blocks to process because I don’t know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).

Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I’ll use so much memory each time I send work to the kernel.

p.s. I’m looking above and although it’s exactly what I’m doing, if it’s confusing then another way of thinking about it (not exactly what I’m doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.

If the numbers are very widely range then not all will match but using my approach I’m forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I’m forced to run less blocks which isn’t efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I’m a stuck with mallocing the max possible results (and buying a lot more memory)?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T06:13:47+00:00Added an answer on June 8, 2026 at 6:13 am

    As Per Roger Dahl’s great answer:
    The functionality you’re looking for is called stream compaction.

    You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.

    However, given a more common situation, where there’s a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

My OpenCV CUDA program runs fine using a single NVidia 580GTX, but when using
I have gone through Cuda programming guide but still not clear where does cuda
I am using Cuda SDK 4.0 and am encountering an issue which has taken
I understand that in CUDA's memory hierachy, we have things like shared memory, texture
I'm using the CUDA .rules file which comes with the CUDA SDK for custom
I figured out that CUDA does not work in 64bit mode on my mac
I am writing a CUDA kernel for Histogram on a picture, but I had
I'm designing a CUDA app to process some video. The algorithm I'm using calls
I have written a CUDA code to solve an NP-Complete problem, but the performance
Using different streams for CUDA kernels makes concurrent kernel execution possible. Therefore n kernels

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.