Before I get into, this is the general idea of what is going on:

Question

0

Editorial Team

Asked: May 25, 20262026-05-25T15:55:58+00:00 2026-05-25T15:55:58+00:00

Before I get into, this is the general idea of what is going on:

0

Before I get into, this is the general idea of what is going on:

The general idea is I have x arrays of floats and I want to add each one sequentially to another array (scalar add):

t = array;

a = array of array;

t = zeros

t += a[0]

t += a[1]

…

t += a[N]

where += represents scalar addition.

This is straight forward. I tried to shrink down the code I have to be as compact as possible and preserve the functionality. The issue here is that for certain size arrays – I’m seeing issues at anything larger than 128 x 128 x 108. Basically the summation of the memory copied back to the host is not the same as what I have calculated it to be. I’ve been stuck on this all day so I’m going to stop wasting my time. I really can’t explain why this is happening. Ive reason through:

Using too much constant space (not using any)
Using too many registers (no)
Incorrect condition in kernel to check if idx,idy,idz are within bounds ( this still could be it)
Something funny with gpu (tried on gt280, and the tesla C1060 and C2060)
Incorrect printf format ( I hope this is it )
*…

That list could go on. Thanks for browsing through this if you have time. The issue almost seems to be memory related (i.e. memory sizes that are > 128*128*108 don’t work. So 64*128*256 will work, or any permutation thereof).

Here Is the full source code (that should be compilable with nvcc):

#include <cuda.h>
#include <iostream>
#include <stdio.h>
#include <assert.h>

#define BSIZE 8

void cudaCheckError(cudaError_t e,const char * msg) {
    if (e != cudaSuccess){
        printf("Error number: %d\n",e);
        printf("%s\n",msg);
    }
};

__global__ void accumulate(float * in,float * out, int3 gdims, int zlevel) {

    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int idy = blockIdx.y*blockDim.y + threadIdx.y;
    int idz = threadIdx.z;

    long int index = (zlevel*((int)BSIZE)+idz)*gdims.x*gdims.y+ \
        idy*gdims.x+ \
        idx;

    if ( idx < gdims.x && idy < gdims.y && (idz + zlevel*(int)BSIZE) < gdims.z) {

        out[index] += in[index];
    }
};

int main(int argc, char * argv[]) {

    int width, 
    height,
    depth; 

    if (argc != 4) {
        printf("Must have 3 inputs: width height depth\n");
        exit(0);
    }
    float tempsum;
    int count =0;
    width = atoi(argv[1]);
    height = atoi(argv[2]);
    depth = atoi(argv[3]);

    printf("Dimensions (%d,%d,%d)\n",width,height,depth);

    int3 dFull;

    dFull.x = width+2;
    dFull.y = height+2;
    dFull.z = depth+2;

    printf("Dimensions (%d,%d,%d)\n",dFull.x,dFull.y,dFull.z);

    int fMemSize=dFull.x*dFull.y*dFull.z;

    int nHostF=9;

    float * f_hostZero;

    float ** f_dev;

    float * f_temp_host;
    float * f_temp_dev;

    dim3 grid( dFull.x/(int)BSIZE+1, dFull.y/(int)BSIZE + 1);

    dim3 threads((int)BSIZE,(int)BSIZE,(int)BSIZE);
    printf("Threads (x,y) : (%d,%d)\nGrid (x,y) : (%d,%d)\n",threads.x,threads.y,grid.x,grid.y);

    int num_zsteps=dFull.z/(int)BSIZE + 1;
    printf("Number of z steps to take : %d\n",num_zsteps);
    // Host array allocation
    f_temp_host = new float[fMemSize];
    f_hostZero = new float[fMemSize];


    // Allocate nHostF address on host 
    f_dev = new float*[nHostF];

    // Host array assignment
    for(int i=0; i < fMemSize; i++){
        f_temp_host[i] = 1.0;
        f_hostZero[i] = 0.0;
    }

    // Device allocations - allocated for array size + 2
    for(int i=0; i<nHostF; i++){
        cudaMalloc((void**)&f_dev[i],sizeof(float)*fMemSize);
    }


    // Allocate the decive pointer
    cudaMalloc( (void**)&f_temp_dev, sizeof(float)*fMemSize);

    cudaCheckError(cudaMemcpy((void *)f_temp_dev,(const void *)f_hostZero,
        sizeof(float)*fMemSize,cudaMemcpyHostToDevice),"At first mem copy");

    printf("Memory regions allocated\n");

    // Copy memory to each array
    for(int i=0; i<nHostF; i++){
        cudaCheckError(cudaMemcpy((void *)(f_dev[i]),(const void *)f_temp_host,
            sizeof(float)*fMemSize,cudaMemcpyHostToDevice),"At first mem copy");
    }

    // Add value 1.0 (from each array n f_dev[i]) to f_temp_dev
    for (int i=0; i<nHostF; i++){
        for (int zLevel=0; zLevel<num_zsteps; zLevel++){
            accumulate<<<grid,threads>>>(f_dev[i],f_temp_dev,dFull,zLevel);
            cudaThreadSynchronize();
        }
        cudaCheckError(cudaMemcpy((void *)f_temp_host,(const void *)f_temp_dev,
            sizeof(float)*fMemSize,cudaMemcpyDeviceToHost),"At mem copy back");
        tempsum=0.f;
        count =0;
        for(int k = 0 ; k< fMemSize; k++){
            tempsum += f_temp_host[k];

            assert ( (int)f_temp_host[k] == (i+1) );
            if ( f_temp_host[k] !=(float)(i+1) ) {
                printf("Found invalid return value\n");
                exit(0);
            }
            count++;
        }
        printf("Total Count: %d\n",count);
        printf("Real Array sum: %18f\nTotal values counted : %d\n",tempsum,count*(i+1));
        printf("Calculated Array sum: %ld\n\n",(i+1)*fMemSize );
    }

    for(int i=0; i<nHostF; i++){
        cudaFree(f_dev[i]);
    }

    cudaFree(f_temp_dev);
    printf("Memory free. Program successfully complete\n");
    delete f_dev;
    delete f_temp_host;
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T15:55:59+00:00

There is nothing wrong with your device code. All that is happening is that at large problems sizes, you are exhausting the capacity of single precision floating point to exactly calculate the large integer values which the code produces at large run sizes. If you replace your host side summation code with Kahan summation, like this:

    tempsum=0.f;
    count =0;
    float c=0.f;
    for(int k = 0 ; k< fMemSize; k++){
        float y = f_temp_host[k] - c;
        float t = tempsum + y;
        c = (t - tempsum) - y;
        tempsum = t;

        assert ( (int)f_temp_host[k] == (i+1) );
        if ( f_temp_host[k] !=(float)(i+1) ) {
            printf("Found invalid return value\n");
            exit(0);
        }
        count++;
    }

you should find the code runs as expected at larger sizes. Alternatively, the host side summation could be done with double precision arithmetic instead. If you have not already read it, I highly recommend What Every Computer Scientist Should Know About Floating-Point Arithmetic. It will help explain where you went wrong in this example, and the wisdom it imparts might help prevent committing similar faux pas in the future.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Before I get into, this is the general idea of what is going on:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply