Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8172347
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T21:45:32+00:00 2026-06-06T21:45:32+00:00

This is a part of some sort of parallel reduction/extremum kernel. I have reduced

  • 0

This is a part of some sort of parallel reduction/extremum kernel. I have reduced it to the minimum code that still gets clBuildProgram crashing (note that it really crashes, and doesn’t just return an error code):

EDIT: It seems like this also happens when local_value is declared global instead of local.

EDIT2 / SOLUTION: The problem was that there was an infinite loop. I should have written remaining_items >>= 1 instead of remaining_items >> 1. As has been said in the answers, the nvidia compiler seems not very robust when it comes to compile/optimization errors.

kernel void testkernel(local float *local_value)
{
    size_t thread_id = get_local_id(0);

    int remaining_items = 1024;

    while (remaining_items > 1)
    {
        // throw away the right half of the threads
        remaining_items >> 1; // <-- SPOTTED THE BUG
        if (thread_id > remaining_items)
        {
            return;
        }

        // look for a greater value in the right half of the memory space
        int right_index = thread_id + remaining_items;
        float right_value = local_value[right_index];
        if (right_value > local_value[thread_id])
        {
            local_value[thread_id] = right_value;
        }

        barrier(CLK_GLOBAL_MEM_FENCE);
    }
}

Removing the lines return; and/or local_value[thread_id] = right_value; causes clBuildProgram to finish successfully.

I can reproduce this problem on all of my computers (NVIDIA GTX 560, GT 555M, GT 540M, they’re all Fermi 2.1 architecture). It’s apparent on the NVIDIA CUDA Toolkit SDK versions 4.0, 4.1 and 4.2, when using either x64 or x86 libraries.

Does anyone have an idea what could be the problem?

Is it possible, that local (aka shared) memory is automatically assumed to be (WORK_GROUP_SIZE) * siezof(its_base_type)? That would explain why it works when the lines I mentioned above are removed.


Minimal host code (C99 compatible) for reproduction:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif

#define RETURN_THROW(expression) do { cl_int ret = expression; if (ret) { printf(#expression " FAILED: %d\n" , ret); exit(1); } } while (0)
#define REF_THROW(expression) do { cl_int ret; expression; if (ret) { printf(#expression " FAILED: %d\n" , ret); exit(1); } } while (0)

int main(int argc, char **argv)
{
    // Load the kernel source code into the array source_str
    FILE *fp;

    fp = fopen("testkernel.cl", "rb");
    if (!fp)
    {
        fprintf(stderr, "Failed to load kernel.\n");
        exit(1);
    }
    fseek(fp, 0, SEEK_END);
    int filesize = ftell(fp);
    rewind(fp);
    char *source_str = (char*)calloc(filesize, sizeof(char));
    size_t bytes_read = fread(source_str, 1, filesize, fp);
    source_str[bytes_read] = 0;
    fclose(fp);

    // Get platform information
    cl_uint num_platforms;
    RETURN_THROW(clGetPlatformIDs(0, NULL, &num_platforms));

    cl_platform_id *platform_ids = (cl_platform_id *)calloc(num_platforms, sizeof(cl_platform_id));
    RETURN_THROW(clGetPlatformIDs(num_platforms, platform_ids, NULL));

    cl_device_id selected_device_id = NULL;

    printf("available platforms:\n");
    for (cl_uint i = 0; i < num_platforms; i++)
    {
        char platform_name[50];
        RETURN_THROW(clGetPlatformInfo(platform_ids[i], CL_PLATFORM_NAME, 50, platform_name, NULL));
        printf("%s\n", platform_name);

        // get devices for this platform
        cl_uint num_devices;
        RETURN_THROW(clGetDeviceIDs(platform_ids[i], CL_DEVICE_TYPE_GPU, 0, NULL, &num_devices));

        cl_device_id *device_ids = (cl_device_id *)calloc(num_devices, sizeof(cl_device_id));
        RETURN_THROW(clGetDeviceIDs(platform_ids[i], CL_DEVICE_TYPE_GPU, num_devices, device_ids, NULL));

        // select first nvidia device
        if (strstr(platform_name, "NVIDIA"))        // ADAPT THIS ACCORDINGLY
        {
            selected_device_id = device_ids[0];
        }
    }

    if (selected_device_id == NULL)
    {
        printf("No NVIDIA device found\n");
        exit(1);
    }

    // Create an OpenCL context
    cl_context context;
    REF_THROW(context = clCreateContext(NULL, 1, &selected_device_id, NULL, NULL, &ret));

    // Create a program from the kernel source
    cl_program program;
    REF_THROW(program = clCreateProgramWithSource(context, 1, (const char **)&source_str, NULL, &ret));

    // Build the program
    cl_int ret = clBuildProgram(program, 1, &selected_device_id, NULL, NULL, NULL);
    if (ret)
    {
        printf("BUILD ERROR\n");
        // build error - get build log and display it
        size_t build_log_size;
        ret = clGetProgramBuildInfo(program, selected_device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &build_log_size);
        char *build_log = new char[build_log_size];
        ret = clGetProgramBuildInfo(program, selected_device_id, CL_PROGRAM_BUILD_LOG, build_log_size, build_log, NULL);
        printf("%s\n", build_log);
        exit(1);
    }

    printf("build finished successfully\n");
    return 0;
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T21:45:35+00:00Added an answer on June 6, 2026 at 9:45 pm

    In my experience the nvidia compiler isn’t very robust when it comes to handling build errors, so you probably have a compile error somewhere.

    I think your problem is indeed the return, or more to the point its combination with barrier. According to the opencl spec about barriers:

    All work-items in a work-group executing the kernel on a processor
    must execute this function before any are allowed to continue
    execution beyond the barrier. This function must be encountered by all
    work-items in a work-group executing the kernel.

    If barrier is inside a conditional statement, then all work-items must enter the
    onditional if any work-item enters the conditional statement and
    executes the barrier.

    If barrer is inside a loop, all work-items
    must execute the barrier for each iteration of the loop before any are
    allowed to continue execution beyond the barrier.

    So I think your problem is probably that a lot of threads would return before getting to the barrier, making this code invalid. Maybe you should try something like this:

    kernel void testkernel(local float *local_value) {
        size_t thread_id = get_local_id(0);
        int remaining_items = 1024;
        while (remaining_items > 1) {
            remaining_items >>= 1;// throw away the right half of the threads
            if (thread_id <= remaining_items) {
                 // look for a greater value in the right half of the memory space
                 int right_index = thread_id + remaining_items;
                 float right_value = local_value[right_index];
                 if (right_value > local_value[thread_id])
                     local_value[thread_id] = right_value;
            }
            barrier(CLK_GLOBAL_MEM_FENCE);
        }
    }
    

    Edit: Furthermore as noted in the comments it needs to be remaining_items>>=1 instead of remaining_items>>1 in order to avoid producing an infinite loop.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This is a part of my code which does some sort of animation; however,
This is a 2 part question. 1) Is there some sort of program that
This is part of some instructions that I was given from a website helping
This question has two parts. Part 1. Yesterday I had some code which would
This part of an app that I am working on, I have the following
I have a list of items that are in this format: 05/01 – Some
This may be a stupid mistake of some sort on my part, but I'm
For some reasons this part where I fetch JSON data from following url will
I was reading this question, but didn't understand some part of the aix's answer
This is really a two part question: I'm seeing some users in the Grantee

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.