Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6225289
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T08:49:00+00:00 2026-05-24T08:49:00+00:00

I have a CUDA program that seems to be hitting some sort of limit

  • 0

I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can’t figure out what that resource is. Here is the kernel function:

__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap, 
                        int segmentCount, int* output)
{
    int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
    int pointCount = segmentCount + 1;

    if(segmentIndex >= segmentCount)
        return;

    int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
    int result = 0;
    if(polylineIndex >= 0)
    {
        float2 p1 = points[segmentIndex];
        float2 p2 = points[segmentIndex+1];
        float2 A = p2;
        float2 a;
        a.x = p2.x - p1.x;
        a.y = p2.y - p1.y;

        for(int i = segmentIndex+2; i < segmentCount; i++)
        {
            int currentPolylineIndex = segmentToPolylineIndexMap[i];

            // if not a different segment within out polyline and
            // not a fake segment
            bool isLegit = (currentPolylineIndex != polylineIndex && 
                currentPolylineIndex >= 0);      

            float2 p3 = points[i];
            float2 p4 = points[i+1];
            float2 B = p4;
            float2 b;
            b.x = p4.x - p3.x;
            b.y = p4.y - p3.y;

            float2 c;
            c.x = B.x - A.x;
            c.y = B.y - A.y;

            float2 b_perp;
            b_perp.x = -b.y;
            b_perp.y = b.x;

            float numerator = dot(b_perp, c);
            float denominator = dot(b_perp, a);
            bool isParallel = (denominator == 0.0);

            float quotient = numerator / denominator;
            float2 intersectionPoint;
            intersectionPoint.x = quotient * a.x + A.x;
            intersectionPoint.y = quotient * a.y + A.y;

            result = result | (isLegit && !isParallel && 
                intersectionPoint.x > min(p1.x, p2.x) && 
                intersectionPoint.x > min(p3.x, p4.x) && 
                intersectionPoint.x < max(p1.x, p2.x) && 
                intersectionPoint.x < max(p3.x, p4.x) && 
                intersectionPoint.y > min(p1.y, p2.y) && 
                intersectionPoint.y > min(p3.y, p4.y) && 
                intersectionPoint.y < max(p1.y, p2.y) && 
                intersectionPoint.y < max(p3.y, p4.y));
        }
    }

    output[segmentIndex] = result;
}

Here is the call to execute the kernel function:

DoCheck<<<702, 32>>>(
    (float2*)devicePoints, 
    deviceSegmentsToPolylineIndexMap, 
    numSegments, 
    deviceOutput);

The sizes of the parameters are as follows:

  • devicePoints = 22,464 float2s = 179,712 bytes
  • deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852 bytes
  • numSegments = 1 int = 4 bytes
  • deviceOutput = 22,463 ints = 89,852 bytes

When I execute this kernel, it crashes the video card. It would appear that I am hitting some sort of limit, because if I execute the kernel using DoCheck<<<300, 32>>>(...);, it works. Just to be clear, the parameters are the same, just the number of blocks is different.

Any idea why one crashes the video driver, and the other doesn’t? The one that fail seems to be still within the card’s limit on number of blocks.

Update
More information on my system configuration:

  • Video Card: nVidia 8800GT
  • CUDA Version: 1.1
  • OS: Windows Server 2008 R2

I also tried it on a laptop with the following configuration, but got the same results:

  • Video Card: nVidia Quadro FX 880M
  • CUDA Version: 1.2
  • OS: Windows 7 64-bit
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T08:49:01+00:00Added an answer on May 24, 2026 at 8:49 am

    The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit.

    On the WDDM Windows platforms you are using, there are three possible solutions/work-arounds:

    1. Get a Telsa card and use the TCC driver, which eliminates the problem completely
    2. Try modifying registry settings to increase the timer limit (google for the TdrDelay registry key for more information, but I am not a Windows user and can’t be more specific than that)
    3. Modify your kernel code to be “re-entrant” and process the data parallel work load in several kernel launches rather than one. Kernel launch overhead isn’t all that large and processing the workload over several kernel runs is often pretty easy to achieve, depending on the algorithm you are using.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

We have some nightly build machines that have the cuda libraries installed, but which
I'm having some trouble with a very basic CUDA program. I have a program
I converted a program from IDL into CUDA that performs some calculations on a
I have a re-useable function in some CUDA code that needs to be called
hi i have a cuda program which run successfully here is code for cuda
I just have a question about my cuda program that I wrote. It allows
I have a CUDA program that calls the kernel repeatedly within a for loop.
I have a data object that looks like this: { 'node-16': { 'tags': ['cuda'],
I have two large vectors, I am trying to do some sort of element
I have written a CUDA code to solve an NP-Complete problem, but the performance

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.