Note 4 So the code is finally fixed! Turned out the final problem was

Question

0

Asked: May 31, 20262026-05-31T03:22:18+00:00 2026-05-31T03:22:18+00:00

Note 4 So the code is finally fixed! Turned out the final problem was

0

Note 4
So the code is finally fixed! Turned out the final problem was that I was adding the size of the space allocated to each array to the ptr, but c already takes into account the size of the variable, so I was essentially adding 4x as much space in bytes as I should have been, hence only the first two elements in a 5-element array would display. The AoSoA is now fully working. Be careful with your mem. management if you try something similar, I struggled with a lot of seemingly silly errors because my initial code was sloppy.

Beware:
+ Improper offsets
+ Needless malloc’s
+ Out of range references

Here’s the working example code, results follow!

#include <stdio.h>

#define REGIONS 20
#define YEARS 5

__inline __host__ void gpuAssert(cudaError_t code, char *file, int line, 
                 bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code),
          file, line);
      if (abort) exit(code);
   }
}

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

struct AnimalPopulationForYear_s
{
   bool isYearEven;
   int * rabbits;
   int * hyenas;
};

AnimalPopulationForYear_s * dev_pop;

__global__ void RunSim(AnimalPopulationForYear_s dev_pop[],
               int year)
{
   int idx = blockIdx.x*blockDim.x+threadIdx.x;
   int rabbits, hyenas;
   int arrEl = year-1;

   rabbits = (idx+1) * year * year; 
   hyenas = rabbits / 10;

   if ( rabbits > 100000 ) rabbits = 100000;   
   if ( hyenas < 2 ) hyenas = 2;

   if ( idx < REGIONS ) dev_pop[arrEl].rabbits[idx] = rabbits;
   if ( idx < REGIONS ) dev_pop[arrEl].hyenas[idx] = hyenas;

   if (threadIdx.x == 0 && blockIdx.x == 0)
      dev_pop[arrEl].isYearEven = (year & 0x01 == 0x0);
}

int main()
{
   //Various reused sizes...
   const size_t fullArrSz = size_t(YEARS) * size_t(REGIONS) * sizeof(int);
   const size_t structArrSz = size_t(YEARS) * sizeof(AnimalPopulationForYear_s);

   //Vars to hold struct and merged subarray memory inside it.
   AnimalPopulationForYear_s * h_pop;
   int * dev_hyenas, * dev_rabbits, * h_hyenas, * h_rabbits, arrEl;

   //Alloc. memory.
   h_pop = (AnimalPopulationForYear_s *) malloc(structArrSz);
   h_rabbits = (int *) malloc(fullArrSz);
   h_hyenas = (int *) malloc(fullArrSz);
   gpuErrchk(cudaMalloc((void **) &dev_pop,structArrSz));
   gpuErrchk(cudaMalloc((void **) &dev_rabbits,fullArrSz));
   gpuErrchk(cudaMalloc((void **) &dev_hyenas,fullArrSz));

   //Offset ptrs.
   for (int i = 0; i < YEARS; i++)
   {
      h_pop[i].rabbits = dev_rabbits+i*REGIONS;
      h_pop[i].hyenas = dev_hyenas+i*REGIONS;
   }

   //Copy host struct with dev. pointers to device.
   gpuErrchk
      (cudaMemcpy(dev_pop,h_pop, structArrSz, cudaMemcpyHostToDevice));

   //Call kernel
   for(int i=1; i < YEARS+1; i++) RunSim<<<REGIONS/128+1,128>>>(dev_pop,i);

   //Make sure nothing went wrong.
   gpuErrchk(cudaPeekAtLastError());
   gpuErrchk(cudaDeviceSynchronize());

   gpuErrchk(cudaMemcpy(h_pop,dev_pop,structArrSz, cudaMemcpyDeviceToHost));
   gpuErrchk
      (cudaMemcpy(h_rabbits, dev_rabbits,fullArrSz, cudaMemcpyDeviceToHost));
   gpuErrchk(cudaMemcpy(h_hyenas,dev_hyenas,fullArrSz, cudaMemcpyDeviceToHost));

   for(int i=0; i < YEARS; i++)
   {
      h_pop[i].rabbits = h_rabbits + i*REGIONS;
      h_pop[i].hyenas = h_hyenas + i*REGIONS;
   }

   for(int i=1; i < YEARS+1; i++)
   {
      arrEl = i-1;
      printf("\nYear %i\n=============\n\n", i);      
      printf("Rabbits\n-------------\n");
      for (int j=0; j < REGIONS; j++)
     printf("Region: %i  Pop: %i\n", j, h_pop[arrEl].rabbits[j]);;      
      printf("Hyenas\n-------------\n");
      for (int j=0; j < REGIONS; j++)
     printf("Region: %i  Pop: %i\n", j, h_pop[arrEl].hyenas[j]);
   }

   //Free on device and host
   cudaFree(dev_pop);
   cudaFree(dev_rabbits);
   cudaFree(dev_hyenas);

   free(h_pop);
   free(h_rabbits);
   free(h_hyenas);

   return 0;
}

[Finally] correct results:

Year 1
=============

Rabbits
————-
Region: 0 Pop: 1
Region: 1 Pop: 2
Region: 2 Pop: 3
Region: 3 Pop: 4
Region: 4 Pop: 5
Region: 5 Pop: 6
Region: 6 Pop: 7
Region: 7 Pop: 8
Region: 8 Pop: 9
Region: 9 Pop: 10
Region: 10 Pop:
11
Region: 11 Pop: 12
Region: 12 Pop: 13
Region: 13
Pop: 14
Region: 14 Pop: 15
Region: 15 Pop: 16
Region:
16 Pop: 17
Region: 17 Pop: 18
Region: 18 Pop: 19
Region: 19 Pop: 20
Hyenas
————-
Region: 0 Pop: 2
Region: 1 Pop: 2
Region: 2 Pop: 2
Region: 3 Pop: 2
Region: 4 Pop: 2
Region: 5 Pop: 2
Region: 6 Pop: 2
Region: 7 Pop: 2
Region: 8 Pop: 2
Region: 9 Pop: 2
Region: 10 Pop: 2
Region: 11 Pop: 2
Region: 12 Pop: 2
Region: 13 Pop:
2
Region: 14 Pop: 2
Region: 15 Pop: 2
Region: 16
Pop: 2
Region: 17 Pop: 2
Region: 18 Pop: 2
Region: 19
Pop: 2

Year 2
=============

Rabbits
————-
Region: 0 Pop: 4
Region: 1 Pop: 8
Region: 2 Pop: 12
Region: 3 Pop: 16
Region: 4 Pop:
20
Region: 5 Pop: 24
Region: 6 Pop: 28
Region: 7
Pop: 32
Region: 8 Pop: 36
Region: 9 Pop: 40
Region:
10 Pop: 44
Region: 11 Pop: 48
Region: 12 Pop: 52
Region: 13 Pop: 56
Region: 14 Pop: 60
Region: 15 Pop:
64
Region: 16 Pop: 68
Region: 17 Pop: 72
Region: 18
Pop: 76
Region: 19 Pop: 80
Hyenas
————-
Region: 0 Pop: 2
Region: 1 Pop: 2
Region: 2 Pop: 2
Region: 3 Pop: 2
Region: 4 Pop: 2
Region: 5 Pop: 2
Region: 6 Pop: 2
Region: 7 Pop: 3
Region: 8 Pop: 3
Region: 9 Pop: 4
Region: 10 Pop: 4
Region: 11 Pop: 4
Region: 12 Pop: 5
Region: 13 Pop:
5
Region: 14 Pop: 6
Region: 15 Pop: 6
Region: 16
Pop: 6
Region: 17 Pop: 7
Region: 18 Pop: 7
Region: 19
…

Note 3:
Following talonmies cleaned up multiple array indexing inconsistencies, etc. in my code.

The results get seeming the correct SoA for the first two spots in the AoSoA (see new output). For some reason the results from the third spot (year 3) on are now giving wrong results, although there’s no error code from the GPU. I’m going to peek at the pointers (h_pop[year-1].rabbits,h_pop[year-1].hyenas) and see if that reveals anything.

My only advice for anyone else attempting AoSoA — be VERY careful with your indexing, and memory allocation. Of course that’s good advice in general, but with all the memory flying around in a complex multi-level data container such as the AoSoA, the tendency for error if you’re sloppy exponentially increases. Thanks for your patience, talonmies.

Note 2:
So following the advice of talonmies, I fixed my loop #ing, wrapped my cuda calls w. the error checking and condensed my cudaMemcpy calls by reusing dev_rabbits/dev_hyenas. Also switched the case to be lowercased on the first letter, as I was thinking about [djmj][4]’s complaint about casing and I realized that NVIDIA does lowercase the first letter in their constants, so [djmj][4] was right, in a sense, I should have styled my code like that for consistency, regardless of my personal preferences/experience.

Also generally cleaned up the code, as I wrote it going on not much sleep and was kinda horrified @ how sloppy it was.

Now I’m running into a new issue, though… my program hangs @ the first cudaMemcpy, and does not return (hence talonmies‘s handy wrapper does not catch anything). I’m not quite sure why this is… I’ve compiled several programs, including much larger/longer-running ones on the device, and they all work fine.

At this point I’m puzzled. If it’s still not working, might post something in the morning.

Note 1
The first answer seemed to really miss the point. This is just a toy code, it’s not meant to represent a real program. Its sole purpose is try to set up the memory, write some junk to it and read it back, in order to verify AoSoA is working.

So commenting to me on shared memory, etc. is not going to be productive. That’s not the point of this thread. Of course if this was a real code I’d be eliminating branching in my kernels, using shared memory, aligning my data, using warp level summation, etc. I’ve done all that in past codes and got it working.

This code is toy, proof of concept code, nothing more, nothing less, designed to try to get AoSoA working. That is its only purpose, it is not a real code. it is a proof of concept.

As for the casing of the var names, I worked at two different places that used fully cased var names in their coding standard (they used tags, I do _s on structs/typedefs), so it kind of stuck. Sorry you don’t like it. As for the indentation I’ll try to fix that later… Windows and Linux were not playing nicely.

One more note, if you’re confused by the device pointer offsetting, see Anycom‘s answer here:
Pointers in structs passed to CUDA

I wrote the following code to test arrays of structures with arrays inside in CUDA….

Edit: Fixed code — hangs after meh and before hi, presumably on the cudaMemcpy… unsure why!

…Any idea what’s going on here and how to fix it?

Note:
I was worried the cudaFrees might be screwing things up, but removing them did nothing.
[4]:

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T03:22:19+00:00

There is an awful lot wrong with this code, but the basic cause of the “garbled” results you are asking about is that you are looking at uninitialized memory. dev_Pop[0].Rabbits is never set to anything in device memory, so you shouldn’t be too surprised that its contents are “garbled”. The root cause of the problem is this:

for(int i=1; i < YEARS+1; i++)
    RunSim<<<REGIONS/128+1,128>>>(dev_Pop,i);

Here you are starting from year=1, meaning year=0 is never set to anything, and year=YEARS is a guaranteed buffer overflow in device memory.

Later in the copy back code, you do this at every iteration:

cudaFree(h_Pop[i].Rabbits);
cudaFree(h_Pop[i].Hyenas);

but your never malloced them in the first place, so the copy back operation will probably fail as well. How it will fail is hard to say without compiling and running the code, but I would guess that the CUDA runtime will completely free dev_Rabbits and dev_Hyenas on the first call. That should make subsequent cudaMemcpy calls in the loop fail. Irrespective of the precise mechanics, I would be incredibly surprised if your copy back loop successfully gets all the data back to the host. A much more sane implementation would be a work-alike of the code you used to construct the device memory image in the first place, something like:

const size_t dsize = size_t(YEARS) * size_t(REGIONS) * sizeof(int);
int * Rabbits = (int *) malloc(dsize);
int * Hyenas = (int *) malloc(dsize);
cudaMemcpy(Rabbits, dev_Rabbits, dsize, cudaMemcpyDeviceToHost);
cudaMemcpy(Hyenas, dev_Hyenas, dsize, cudaMemcpyDeviceToHost);

for(int i=0; i < YEARS; i++)
{
    h_Pop[i].Rabbits = Rabbits + i*REGIONS;
    h_Pop[i].Hyenas = Hyenas + i*REGIONS;
}

Doing this way gets rid of a lot of redundant device->host transactions over the PCI-e bus, and all those unnecessary host side malloc calls within the loop.

So I would guess that there are multiple points of runtime failure happening in the code, but because you have neglected to include any error checking, things are silently failing and you just don’t notice. To fix that add something like this in your code:

inline void gpuAssert(cudaError_t code, char * file, int line, bool Abort=true)
{
    if (code != cudaSuccess) {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file,line);
        if (Abort) exit(code);
    }       
}

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

and then used gpuErrchk to test the return status of every API call, for example:

gpuErrchk(cudaMalloc((void **) &dev_Pop,YEARS*sizeof(AnimalPopulationForYear_s)));

For your kernel launches, I recommend doing this:

RunSim<<<REGIONS/128+1,128>>>(dev_Pop,i);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());

That will trap both illegal arguments and resource exhaustion which would cause a launch failure, and any execution errors which would cause a kernel abort. Armed with this error checking, I suspect you will find a lot of holes to fix before the code actually runs to completion….

EDIT:

It seems you have decided to invent new and unusual ways for your revised code to not work – including breaking the very thing that you had correct in your original code and seemed to be the subject of your question – the construction of a device memory array of structures.

Here is a slightly simplified and working version of your second code. All I can suggest is study it until you see why it works where your current version fails.

#include <cstdio>
#include <cstdlib>

#define REGIONS 20
#define YEARS 5
#define POPMIN 2
#define POPMAX 100000

inline void gpuAssert(cudaError_t code, char *file, int line, 
                 bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code),
          file, line);
      if (abort) exit(code);
   }
}

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

struct Population_s
{
   int * rabbits;
   int * hyenas;
};

__global__ void RunSim(Population_s * dev_pop, int year)
{
   int idx = blockIdx.x*blockDim.x+threadIdx.x;

   if (idx < REGIONS) {
      int rabbits, hyenas;

      rabbits = min(POPMAX, idx * year * year); 
      hyenas = max(POPMIN, rabbits / 10);

      dev_pop[year-1].rabbits[idx] = rabbits;
      dev_pop[year-1].hyenas[idx] = hyenas;
   }
}

int main()
{
   const size_t subArrSz = size_t(REGIONS) * sizeof(int);
   const size_t fullArrSz = size_t(YEARS) * subArrSz;
   const size_t structArrSz = size_t(YEARS) * sizeof(Population_s);

   Population_s * h_pop = (Population_s *) malloc(structArrSz);
   int * h_rabbits = (int *) malloc(fullArrSz);
   int * h_hyenas = (int *) malloc(fullArrSz);

   Population_s * dev_pop;
   int * dev_hyenas, * dev_rabbits;

   gpuErrchk(cudaMalloc((void **) &dev_pop,structArrSz));
   gpuErrchk(cudaMalloc((void **) &dev_hyenas,fullArrSz));
   gpuErrchk(cudaMalloc((void **) &dev_rabbits,fullArrSz));

   gpuErrchk(cudaMemset(dev_rabbits, 1, fullArrSz));
   gpuErrchk(cudaMemset(dev_hyenas, 1, fullArrSz));

   for (int i = 0; i < YEARS; i++)
   {
      h_pop[i].rabbits = dev_rabbits + i*REGIONS;
      h_pop[i].hyenas = dev_hyenas + i*REGIONS;
   }

   gpuErrchk
      (cudaMemcpy(dev_pop,h_pop, structArrSz, cudaMemcpyHostToDevice));

   for(int i = 1; i < (YEARS+1); i++) {
       RunSim<<<REGIONS/128+1,128>>>(dev_pop,i);
       gpuErrchk(cudaPeekAtLastError());
       gpuErrchk(cudaDeviceSynchronize());
   }

   gpuErrchk(cudaMemcpy(h_rabbits, dev_rabbits, fullArrSz, cudaMemcpyDeviceToHost));
   gpuErrchk(cudaMemcpy(h_hyenas, dev_hyenas, fullArrSz, cudaMemcpyDeviceToHost));

   for(int i=0; i < YEARS; i++)
   {
      h_pop[i].rabbits = h_rabbits + i*REGIONS;
      h_pop[i].hyenas = h_hyenas + i*REGIONS;
   }

   for(int i=0; i < YEARS; i++)
   {
      printf("\n=============\n");   
      printf("Year %i\n=============\n\n", i+1);   
      printf("Rabbits\n-------------\n", i);
      for (int j=0; j < REGIONS; j++)
         printf("Region: %i  Pop: %i\n", j, h_pop[i].rabbits[j]);;      
      printf("\nHyenas\n-------------\n", i);
      for (int j=0; j < REGIONS; j++)
         printf("Region: %i  Pop: %i\n", j, h_pop[i].hyenas[j]);
   }

   cudaFree(dev_pop);
   cudaFree(dev_rabbits);
   cudaFree(dev_hyenas);

   free(h_pop);
   free(h_rabbits);
   free(h_hyenas);

   return 0;
}

As a final tip – don’t use anything from the SDK cutil library in your own code, that isn’t what it is intended for. It isn’t an official part of CUDA, doesn’t have documentation, isn’t considered production ready, and isn’t guaranteed to either work, be the same, or even exist in any given release of the CUDA SDK.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Note 4 So the code is finally fixed! Turned out the final problem was

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply