Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8878313
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T19:40:59+00:00 2026-06-14T19:40:59+00:00

I am creating a test program that will create a device and a host

  • 0

I am creating a test program that will create a device and a host array of size n and then launch a kernel that creates n threads which allocate the constant value 0.95f to each location in the device array. After completion, the device array is copied to the host array and all entries are totaled and a final total is displayed.

The program below seems to work fine for array sizes up to around 60 million floats and returns the correct results very quickly, but upon reaching 70 million the program seems to hang for a while and eventually returns a NAN result for the total. Inspecting the host array after a 60 million run shows it is correctly populated with 0.95f, but inspecting it after a 70 million run shows it is populated with NAN. As far as I am aware none of the CUDA calls return errors.

I am using a 2GB GT640m (Compute 3.0), giving me a max block size of 1024 and a max grid dimension of 2147483647.

I am sure there are better ways of achieving something similar, and I would like to hear suggestions. But I would also like to understand what has gone wrong here so I can learn from it.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <fstream>

void cudaErrorHandler(cudaError_t status)
{
    // Cuda call returned an error, just print error for now
    if(status != cudaSuccess)
    {
        printf("Error");
    }
}

__global__ void addKernel(float* _Results, int _TotalCombinations)
{
    // Get thread Id
    unsigned int Id = (blockDim.x * blockDim.y * blockIdx.x) + (blockDim.x * threadIdx.y) + threadIdx.x;

    //If the Id is within simulation range, log it
    if(Id < _TotalCombinations)
    {
        _Results[Id] = 0.95f;
    }
}

#define BLOCK_DIM_X 32
#define BLOCK_DIM_Y 32
#define BLOCK_SIZE BLOCK_DIM_X * BLOCK_DIM_Y // Statc block size of 32*32 (1024)
#define CUDA_CALL(x) cudaErrorHandler(x)

int main()
{
    // The number of simulations to run
    unsigned int totalCombinations = 45000000;

    int gridsize = 1;

    // Work out how many blocks of size 1024 are required to perform all of totalCombinations
    for(unsigned int totalsize = gridsize * BLOCK_SIZE; totalsize < totalCombinations; 
        gridsize++, totalsize = gridsize * BLOCK_SIZE)
        ;

    // Allocate host memory
    float* host_results = new float[totalCombinations];
    memset(host_results, 0, sizeof(float) * totalCombinations);
    float *dev_results = 0;

    cudaSetDevice(0);

    // Allocate device memory
    CUDA_CALL(cudaMalloc((void**)&dev_results, totalCombinations * sizeof(float)));

    dim3 grid, block;

    block = dim3(BLOCK_DIM_X, BLOCK_DIM_Y);

    grid = dim3(gridsize);

    // Launch kernel
    addKernel<<<gridsize, block>>>(dev_results, totalCombinations);

    // Wait for synchronize
    CUDA_CALL(cudaDeviceSynchronize());

    // Copy device data back to host
    CUDA_CALL(cudaMemcpy(host_results, dev_results, totalCombinations * sizeof(float), cudaMemcpyDeviceToHost));

    double total = 0.0;

    // Total the results in the host array
    for(unsigned int i = 0; i < totalCombinations; i++)
        total+=host_results[i];

    // Print results to screen
    printf("Total %f\n", total);

    delete[] host_results;

    return 0;
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T19:41:01+00:00Added an answer on June 14, 2026 at 7:41 pm

    As you’ve discovered, your error handling method is not working. Below I have pasted a version of your code with an error checking method that I use frequently. The reason things are not working at your failure point is that your gridsize (you are launching a 1D grid) is exceeding the maximum grid size in the X dimension (65535 by default, ie. for compute capability up to 2.x). If you want to take advantage of a larger gridsize dimension (2^31 -1 is limit with compute capability 3.0), you need to compile with the -arch=sm_30 switch.

    Just for reference here is a version of your code which shows an error-checking method that I use frequently.

    #include <stdio.h>
    #include <fstream>
    
    
    #define cudaCheckErrors(msg) \
        do { \
            cudaError_t __err = cudaGetLastError(); \
            if (__err != cudaSuccess) { \
                fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                    msg, cudaGetErrorString(__err), \
                    __FILE__, __LINE__); \
                fprintf(stderr, "*** FAILED - ABORTING\n"); \
                exit(1); \
            } \
        } while (0)
    
    __global__ void addKernel(float* _Results, int _TotalCombinations)
    {
        // Get thread Id
        unsigned int Id = (blockDim.x * blockDim.y * blockIdx.x) + (blockDim.x * threadIdx.y) + threadIdx.x;
    
        //If the Id is within simulation range, log it
        if(Id < _TotalCombinations)
        {
            _Results[Id] = 0.95f;
        }
    }
    
    #define BLOCK_DIM_X 32
    #define BLOCK_DIM_Y 32
    #define BLOCK_SIZE BLOCK_DIM_X * BLOCK_DIM_Y // Statc block size of 32*32 (1024)
    
    int main()
    {
        // The number of simulations to run
        unsigned int totalCombinations = 65000000;
    
        int gridsize = 1;
    
        // Work out how many blocks of size 1024 are required to perform all of totalCombinations
        for(unsigned int totalsize = gridsize * BLOCK_SIZE; totalsize < totalCombinations;
            gridsize++, totalsize = gridsize * BLOCK_SIZE)
            ;
        printf("gridsize = %d, blocksize = %d\n", gridsize, BLOCK_SIZE);
        // Allocate host memory
        float* host_results = new float[totalCombinations];
        memset(host_results, 0, sizeof(float) * totalCombinations);
        float *dev_results = 0;
    
        cudaSetDevice(0);
    
        // Allocate device memory
        cudaMalloc((void**)&dev_results, totalCombinations * sizeof(float));
        cudaCheckErrors("cudaMalloc fail");
    
        dim3 grid, block;
    
        block = dim3(BLOCK_DIM_X, BLOCK_DIM_Y);
    
        grid = dim3(gridsize);
    
        // Launch kernel
        addKernel<<<gridsize, block>>>(dev_results, totalCombinations);
        cudaCheckErrors("kernel fail");
        // Wait for synchronize
        cudaDeviceSynchronize();
        cudaCheckErrors("sync fail");
    
        // Copy device data back to host
        cudaMemcpy(host_results, dev_results, totalCombinations * sizeof(float), cudaMemcpyDeviceToHost);
        cudaCheckErrors("cudaMemcpy 2 fail");
    
        double total = 0.0;
    
        // Total the results in the host array
        for(unsigned int i = 0; i < totalCombinations; i++)
            total+=host_results[i];
    
        // Print results to screen
        printf("Total %f\n", total);
    
        delete[] host_results;
    
        return 0;
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am creating a program that will run an experiment on a user. It
The following code is a test program that creates an xlsx blob using Apache
I am creating a test program to test the functionality of program which calcultes
I am creating a program to see if I can run a byte array
I'm building a Java program that will automatically run a hundred or so tests.
I am creating a custom .net hardware framework that will be used by other
I'm creating a program that installs a keyboard hook to capture all keys and
We have custom code that wipes and initializes our test database by creating a
I'm currently creating an SSIS job that will pull picture data from a SQL
Trying to create a widget in the baseclass and the only thing that will

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.