Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6579715
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T15:55:58+00:00 2026-05-25T15:55:58+00:00

Before I get into, this is the general idea of what is going on:

  • 0

Before I get into, this is the general idea of what is going on:

The general idea is I have x arrays of floats and I want to add each one sequentially to another array (scalar add):

t = array;

a = array of array;

t = zeros

t += a[0]

t += a[1]

…

t += a[N]

where += represents scalar addition.

This is straight forward. I tried to shrink down the code I have to be as compact as possible and preserve the functionality. The issue here is that for certain size arrays – I’m seeing issues at anything larger than 128 x 128 x 108. Basically the summation of the memory copied back to the host is not the same as what I have calculated it to be. I’ve been stuck on this all day so I’m going to stop wasting my time. I really can’t explain why this is happening. Ive reason through:

  • Using too much constant space (not using any)
  • Using too many registers (no)
  • Incorrect condition in kernel to check if idx,idy,idz are within bounds ( this still could be it)
  • Something funny with gpu (tried on gt280, and the tesla C1060 and C2060)
  • Incorrect printf format ( I hope this is it )
    *…

That list could go on. Thanks for browsing through this if you have time. The issue almost seems to be memory related (i.e. memory sizes that are > 128*128*108 don’t work. So 64*128*256 will work, or any permutation thereof).

Here Is the full source code (that should be compilable with nvcc):

#include <cuda.h>
#include <iostream>
#include <stdio.h>
#include <assert.h>

#define BSIZE 8

void cudaCheckError(cudaError_t e,const char * msg) {
    if (e != cudaSuccess){
        printf("Error number: %d\n",e);
        printf("%s\n",msg);
    }
};

__global__ void accumulate(float * in,float * out, int3 gdims, int zlevel) {

    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int idy = blockIdx.y*blockDim.y + threadIdx.y;
    int idz = threadIdx.z;

    long int index = (zlevel*((int)BSIZE)+idz)*gdims.x*gdims.y+ \
        idy*gdims.x+ \
        idx;

    if ( idx < gdims.x && idy < gdims.y && (idz + zlevel*(int)BSIZE) < gdims.z) {

        out[index] += in[index];
    }
};

int main(int argc, char * argv[]) {

    int width, 
    height,
    depth; 

    if (argc != 4) {
        printf("Must have 3 inputs: width height depth\n");
        exit(0);
    }
    float tempsum;
    int count =0;
    width = atoi(argv[1]);
    height = atoi(argv[2]);
    depth = atoi(argv[3]);

    printf("Dimensions (%d,%d,%d)\n",width,height,depth);

    int3 dFull;

    dFull.x = width+2;
    dFull.y = height+2;
    dFull.z = depth+2;

    printf("Dimensions (%d,%d,%d)\n",dFull.x,dFull.y,dFull.z);

    int fMemSize=dFull.x*dFull.y*dFull.z;

    int nHostF=9;

    float * f_hostZero;

    float ** f_dev;

    float * f_temp_host;
    float * f_temp_dev;

    dim3 grid( dFull.x/(int)BSIZE+1, dFull.y/(int)BSIZE + 1);

    dim3 threads((int)BSIZE,(int)BSIZE,(int)BSIZE);
    printf("Threads (x,y) : (%d,%d)\nGrid (x,y) : (%d,%d)\n",threads.x,threads.y,grid.x,grid.y);

    int num_zsteps=dFull.z/(int)BSIZE + 1;
    printf("Number of z steps to take : %d\n",num_zsteps);
    // Host array allocation
    f_temp_host = new float[fMemSize];
    f_hostZero = new float[fMemSize];


    // Allocate nHostF address on host 
    f_dev = new float*[nHostF];

    // Host array assignment
    for(int i=0; i < fMemSize; i++){
        f_temp_host[i] = 1.0;
        f_hostZero[i] = 0.0;
    }

    // Device allocations - allocated for array size + 2
    for(int i=0; i<nHostF; i++){
        cudaMalloc((void**)&f_dev[i],sizeof(float)*fMemSize);
    }


    // Allocate the decive pointer
    cudaMalloc( (void**)&f_temp_dev, sizeof(float)*fMemSize);

    cudaCheckError(cudaMemcpy((void *)f_temp_dev,(const void *)f_hostZero,
        sizeof(float)*fMemSize,cudaMemcpyHostToDevice),"At first mem copy");

    printf("Memory regions allocated\n");

    // Copy memory to each array
    for(int i=0; i<nHostF; i++){
        cudaCheckError(cudaMemcpy((void *)(f_dev[i]),(const void *)f_temp_host,
            sizeof(float)*fMemSize,cudaMemcpyHostToDevice),"At first mem copy");
    }

    // Add value 1.0 (from each array n f_dev[i]) to f_temp_dev
    for (int i=0; i<nHostF; i++){
        for (int zLevel=0; zLevel<num_zsteps; zLevel++){
            accumulate<<<grid,threads>>>(f_dev[i],f_temp_dev,dFull,zLevel);
            cudaThreadSynchronize();
        }
        cudaCheckError(cudaMemcpy((void *)f_temp_host,(const void *)f_temp_dev,
            sizeof(float)*fMemSize,cudaMemcpyDeviceToHost),"At mem copy back");
        tempsum=0.f;
        count =0;
        for(int k = 0 ; k< fMemSize; k++){
            tempsum += f_temp_host[k];

            assert ( (int)f_temp_host[k] == (i+1) );
            if ( f_temp_host[k] !=(float)(i+1) ) {
                printf("Found invalid return value\n");
                exit(0);
            }
            count++;
        }
        printf("Total Count: %d\n",count);
        printf("Real Array sum: %18f\nTotal values counted : %d\n",tempsum,count*(i+1));
        printf("Calculated Array sum: %ld\n\n",(i+1)*fMemSize );
    }

    for(int i=0; i<nHostF; i++){
        cudaFree(f_dev[i]);
    }

    cudaFree(f_temp_dev);
    printf("Memory free. Program successfully complete\n");
    delete f_dev;
    delete f_temp_host;
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T15:55:59+00:00Added an answer on May 25, 2026 at 3:55 pm

    There is nothing wrong with your device code. All that is happening is that at large problems sizes, you are exhausting the capacity of single precision floating point to exactly calculate the large integer values which the code produces at large run sizes. If you replace your host side summation code with Kahan summation, like this:

        tempsum=0.f;
        count =0;
        float c=0.f;
        for(int k = 0 ; k< fMemSize; k++){
            float y = f_temp_host[k] - c;
            float t = tempsum + y;
            c = (t - tempsum) - y;
            tempsum = t;
    
            assert ( (int)f_temp_host[k] == (i+1) );
            if ( f_temp_host[k] !=(float)(i+1) ) {
                printf("Found invalid return value\n");
                exit(0);
            }
            count++;
        }
    

    you should find the code runs as expected at larger sizes. Alternatively, the host side summation could be done with double precision arithmetic instead. If you have not already read it, I highly recommend What Every Computer Scientist Should Know About Floating-Point Arithmetic. It will help explain where you went wrong in this example, and the wisdom it imparts might help prevent committing similar faux pas in the future.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to get the e.result before it pumps data into my listbox..I want
I'm writing a class to insert users into a database, and before I get
I know this has been asked before ( Get Absolute Position of element within
Before I get anywhere with this post, let me make this clear, there is
I am risking this question being closed before i get an answer, but i
I need some help, and before we get going, I know it is probably
First of all, I want to make something clear before I get yelled at:
I want to break up this archive list into multiple columns (three columns)...I don't
I'm new to programming in general and currently trying to get a foot into
Occasionally, on a ASP (classic) site users will get this error: [DBNETLIB][ConnectionRead (recv()).]General network

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.