Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7525657
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T03:36:43+00:00 2026-05-30T03:36:43+00:00

I need some clarification. I’m developing OpenCL on my laptop running a small nvidia

  • 0

I need some clarification. I’m developing OpenCL on my laptop running a small nvidia GPU (310M). When I query the device for CL_DEVICE_MAX_COMPUTE_UNITS, the result is 2. I read the number of work groups for running a kernel should correspond to the number of compute units (Heterogenous Computing with OpenCL, Chapter 9, p. 186), otherwise it would waste too much global memory bandwitdh.

Also the chip is specified to have 16 cuda cores (which correspond to PEs I believe). Does that mean theoretically, the most performant setup for this gpu, regarding global memory bandwith, is to have two work groups with 16 work items each?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T03:36:44+00:00Added an answer on May 30, 2026 at 3:36 am

    While setting the number of work groups to be equal to CL_DEVICE_MAX_COMPUTE_UNITS might be sound advice on some hardware, it certainly is not on NVIDIA GPUs.

    On the CUDA architecture, an OpenCL compute unit is the equivalent of a multiprocessor (which can have either 8, 32 or 48 cores at the time of writing), and these are designed to be able to simultaneously run up to 8 work groups (blocks in CUDA) each. At larger input data sizes, you might choose to run thousands of work groups, and your particular GPU can handle up to 65535 x 65535 work groups per kernel launch.

    OpenCL has another device attribute CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. If you query that on an NVIDIA device, it will return 32 (this is the "warp", or natural SIMD width of the hardware). That value is the work group size multiple you should use; work group sizes can be up to 512 items each, depending on the resources consumed by each work item. The standard rule of thumb for your particular GPU is that you require at least 192 active work items per compute unit (threads per multiprocessor in CUDA terms) to cover all the latency the architecture and potentially obtain either full memory bandwidth or full arithmetic throughput, depending on the nature of your code.

    NVIDIA ship a good document called "OpenCL Programming Guide for the CUDA Architecture" in the CUDA toolkit. You should take some time to read it, because it contains all the specifics of how the NVIDIA OpenCL implementation maps onto the features of their hardware, and it will answer the questions you have raised here.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I just need some clarification on a line I read after running rvm requirements
I need some clarification. I have a Reportwriter dll that uses Crystal Reports. It
I just need some clarification on variables A normal variable has 2 parts to
Just need some clarification on how to design a python script file test.py. When
I think the docs http://guides.rubyonrails.org/asset_pipeline.html need some clarification. They state: For example, if a
I'm in need of some clarification. I've been reading about REST, and building RESTful
I need some clarification. Are these two methods the same or different? I get
I need some clarification how MS-DTC will behave in scenario given below 1) I
i need some clarification on how to populate select(s) with data from mysql. Basically
I need some clarification about the behaviour of svn switch. I'm using SVN version

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.