I ported this piece of code: if(_layersCount > 1) { for(int i=_layersCount-2;i>=0;i–) { for(int

Question

0

Asked: June 14, 20262026-06-14T12:38:17+00:00 2026-06-14T12:38:17+00:00

I ported this piece of code: if(_layersCount > 1) { for(int i=_layersCount-2;i>=0;i–) { for(int

0

I ported this piece of code:

    if(_layersCount > 1)
    {
        for(int i=_layersCount-2;i>=0;i--)
        {
            for(int j=0;j<_neuronsPerLayerCount[i];j++) // cuda kernel
            {
                localGradients[indexByLayerAndNeuron(i, j)] = 0;

                for(int k=0;k<_neuronsPerLayerCount[i+1];k++)
                {
                    localGradients[indexByLayerAndNeuron(i, j)] += _neuronsInputsWeights[indexByLayerNeuronAndInput(i+1, k, j)]
                                                                    * localGradients[indexByLayerAndNeuron(i+1, k)];
                }

                localGradients[indexByLayerAndNeuron(i, j)] *= derivatives[indexByLayerAndNeuron(i, j)];
            }
        }
    }

to CUDA:

    if(_layersCount > 1)
    {
        for(int i=_layersCount-2;i>=0;i--)
        {
            // calculateLocalGradientsForAnotherLayers
            blocksCount = floor((double) _neuronsPerLayerCount[i] / threads.x) + 1;
            blocks = dim3(blocksCount, 1);

            calculateLocalGradientsForAnotherLayers <<<blocks, threads>>> (deviceLocalGradients, _neuronsInputsWeights, deviceDerivatives, _neuronsPerLayerCount[i], _neuronsInPreviousLayers[i], _neuronsInPreviousLayers[i+1], _neuronsPerLayerCount[i+1], _inputsInPreviousLayers[i], _inputsInCurrentLayer[i]);
        }
    }

The calculateLocalGradientsForAnotherLayers kernel:

__global__ void calculateLocalGradientsForAnotherLayers(double * localGradients, double * neuronsInputsWeights, double * derivatives, int neuronsCount, int neuronsInPreviousLayers, int neuronsInPreviousLayersWithCurrent, int neuronsInNextLayer, int inputsInPreviousLayers, int inputsInCurrentLayer)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if(idx < neuronsCount)
    {
        int neuron = neuronsInPreviousLayers + idx;

        localGradients[neuron] = 0;

        // this to Kernel, then reduce localGradients.
        for(int k=0;k<neuronsInNextLayer;k++)
        {
            localGradients[neuron] += neuronsInputsWeights[inputsInPreviousLayers + k*inputsInCurrentLayer + idx]
                                                            * localGradients[neuronsInPreviousLayersWithCurrent + k];
        }

        localGradients[neuron] *= derivatives[neuron];
    }
}

But I see the difference in the results from the second decimal place. Why error is so large? All kernels works good except this.

My GPU is NV GF555M. It supports double precision.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T12:38:19+00:00

In the body of your kernel, you need some kind of inter-block synchronization over localGradients array:

for(int k=0;k<neuronsInNextLayer;k++)
        {
            localGradients[neuron] += neuronsInputsWeights[inputsInPreviousLayers + k*inputsInCurrentLayer + idx]
                                                            * localGradients[neuronsInPreviousLayersWithCurrent + k];
        }

Concurrent read/write accesses may destroy the actual value of localGradients elements. Since there is no synchronization on the read/write, you may see random results.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I ported this piece of code: if(_layersCount > 1) { for(int i=_layersCount-2;i>=0;i–) { for(int

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply