Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7081411
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T06:55:04+00:00 2026-05-28T06:55:04+00:00

I’ve been going through a few examples, reducing an array of elements to one

  • 0

I’ve been going through a few examples, reducing an array of elements to one element, without success. Someone posted this on an NVIDIA forum. I have changed from floating point variables to integers.

__kernel void sum(__global const short *A,__global unsigned long  *C,uint size, __local unsigned long *L) {
            unsigned long sum=0;
            for(int i=get_local_id(0);i<size;i+=get_local_size(0))
                    sum+=A[i];
            L[get_local_id(0)]=sum;

            for(uint c=get_local_size(0)/2;c>0;c/=2)
            {
                    barrier(CLK_LOCAL_MEM_FENCE);
                    if(c>get_local_id(0))
                            L[get_local_id(0)]+=L[get_local_id(0)+c];

            }
            if(get_local_id(0)==0)
                    C[0]=L[0];
            barrier(CLK_LOCAL_MEM_FENCE);
}

Does this look right? The third argument “size”, is that supposed to be the local work size, or global work size?

I set up my arguments like this,

clSetKernelArg(ocReduce, 0, sizeof(cl_mem), (void*) &DevA);
clSetKernelArg(ocReduce, 1, sizeof(cl_mem), (void*) &DevC); 
clSetKernelArg(ocReduce, 2, sizeof(uint),   (void*) &size);  
clSetKernelArg(ocReduce, 3, LocalWorkSize * sizeof(unsigned long), NULL); 

The first argument which is the input, I am trying to retain from the output of the kernel launched before it.

clRetainMemObject(DevA);
clEnqueueNDRangeKernel(hCmdQueue[Plat-1][Dev-1], ocKernel, 1, NULL, &GlobalWorkSize, &LocalWorkSize, 0, NULL, NULL);
//the device memory object DevA now has the data to be reduced

clEnqueueNDRangeKernel(hCmdQueue[Plat-1][Dev-1], ocReduce, 1, NULL, &GlobalWorkSize, &LocalWorkSize, 0, NULL, NULL);
clEnqueueReadBuffer(hCmdQueue[Plat-1][Dev-1],DevRE, CL_TRUE, 0, sizeof(unsigned long)*512,(void*) RE , 0, NULL, NULL);

Today I plan to try and convert the following cuda reduction example into openCL.

__global__ voidreduce1(int*g_idata, int*g_odata){
extern __shared__ intsdata[];

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();


for(unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}

// write result for this block to global mem
if(tid == 0) g_odata[blockIdx.x] = sdata[0];
}

There is a more optimized, (completely unrolled+multiple elements per thread).

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

Is this possible using openCL?

Grizzly gave me this advice the other day,

“…use a reduction kernel which operates on n element and reduces them to something like n / 16 (or any other number). Then you iteratively call that kernel until you are down to one element, which is your result”

I want to try this as well, but I don’t exactly know where to start, and I want to first just get something to work.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T06:55:05+00:00Added an answer on May 28, 2026 at 6:55 am

    The first reduction code you gave should work as long as only one workgroup is working on the reduction (so get_global_size(0) == get_local_size(0)). In that case the size argument of the kernel would be the number of elements in A (which has no real correlation to either the global or the local worksize). While that is a workable solution, it seems inheriantly wasteful to let most of the gpu idle while doing the reduction, which is precisely why I proposed iteratively calling a reduction kernel. This would be made possible with only slight modifications to the code:

    __kernel void sum(__global const short *A, __global unsigned long  *C, uint size, __local unsigned long *L) {
            unsigned long sum=0;
            for(int i=get_global_id(0); i < size; i += get_global_size(0))
                    sum += A[i];
            L[get_local_id(0)]=sum;
    
            for(uint c=get_local_size(0)/2;c>0;c/=2)
            {
                    barrier(CLK_LOCAL_MEM_FENCE);
                    if(c>get_local_id(0))
                            L[get_local_id(0)]+=L[get_local_id(0)+c];
    
            }
            if(get_local_id(0)==0)
                    C[get_group_id(0)]=L[0];
            barrier(CLK_LOCAL_MEM_FENCE);
    }
    

    Calling this with a GlobalWorkSize smaller then size (e.g. 4) will reduce the input in A by a factor of 4*LocalWorkSize, which can be iterated (by using the output buffer as input for the next call to sum with a different output buffer. Well actually that isn’t quite true, since the second (and all following) iteration needs A to be of type global const unsigned long*, so you will actually need to kernels, but you get the idea.

    Concerning the cuda reduction sample: Why would you bother converting it, it works basically exactly like the opencl version I posted above does, except reducing only by a hardcoded size per iteration (2*LocalWorkSize insted of size/GlobalWorkSize*LocalWorkSize).

    Personally I use practically the same approach for the reduction, although I have split the kernel in two parts and only use the path using local memory for the last iteration:

    __kernel void reduction_step(__global const unsigned long* A, __global unsigned long  * C, uint size) {
            unsigned long sum=0;
            for(int i=start; i < size; i += stride)
                    sum += A[i];
            C[get_global_id(0)]= sum;
    }
    

    For the final step the full version which does reduction inside the work group was used. Of course you would need a second version of reduction step taking global const short* and this code is an untested adaption of your code (I can’t post my own version, regretably). The advantage of this approach is the much lesser complexity of the kernel doing most of the work, and less amount of wasted work due to divergent branches. Which made it a bit faster then the other variant. However I have no results for either the newest compilerversion nor the newest hardware so that point might or might not be correct anymore (though I suspect it might since due to the reduced amount of divergent branches).

    Now for the paper you linked in: It is certainly possible to use the optimizations suggested in that paper in opencl, except for the use of templates, which are not supported by opencl, so the blocksizes would have to be hardcoded. Of course the opencl version already does multiple adds per kernel and, if you follow the approach I mentioned above, would not really benefit from unrolling the reduction through local memory, since that is only done in the last step, which shouldn’t take a significant part of the whole calculation time for a big enough imput. Furthermore I find the lack of synchronization in the unrolled implementation a bit troublesome. That only works because all threads going in that part belong to the same warp. This however isn’t necessary true when executing on any hardware other then current nvidia cards (future nvidia cards, amd cards and cpus (although I think it should work for current amd cards and current cpu implementations, but I wouldn’t necessarily count on it)), so I would stay away from that unless I needed the absolute last bit of speed for the reduction (and then still provide a generic version and switch to that if I don’t recognize the hardware or something like that).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a jquery bug and I've been looking for hours now, I can't
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I am currently running into a problem where an element is coming back from
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I am reading a book about Javascript and jQuery and using one of the
I am trying to loop through a bunch of documents I have to put
I'm making a simple page using Google Maps API 3. My first. One marker
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.