Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8845663
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T11:47:29+00:00 2026-06-14T11:47:29+00:00

I have some questions hanging in the air without an answer for a few

  • 0

I have some questions hanging in the air without an answer for a few days now. The questions arose because I have an OpenMP and an OpenCL implementations of the same problem. The OpenCL runs perfectly on GPU but has 50% less performance when ran on CPU (compared to the OpenMP implementation). A post is already dealing with the difference between OpenMP and OpenCL performances, but it doesn’t answer my questions. At the moment I face these questions:

1) Is it really that important to have “vectorized kernel” (in terms of the Intel Offline Compiler)?

There is a similar post, but I think my question is more general.

As I understand: a vectorized kernel not necessarily means that there is no vector/SIMD instruction in the compiled binary. I checked the assembly codes of my kernels, and there are a bunch of SIMD instructions. A vectorized kernel means that by using SIMD instructions you can execute 4 (SSE) or 8 (AVX) OpenCL “logical” threads in one CPU thread. This can only be achieved if ALL your data is consecutively stored in the memory. But who has such perfectly sorted data?

So my question would be: Is it really that important to have your kernel “vectorized” in this sense?

Of course it gives performance improvement, but if most of the computation intensive parts in the kernel are done by vector instructions then you might get near the “optimal” performance. I think the answer to my question lies in the memory bandwidth. Probably vector registers better fit to efficient memory access. In that case the kernel arguments (pointers) have to be vectorized.

2) If I allocate data in local memory on a CPU, where will it be allocated? OpenCL shows L1 cache as local memory, but it is clearly not the same type of memory like on GPU’s local memory. If its stored in the RAM/global memory, then there is no sense copying data into it. If it would be in cache, some other process would might flush it out… so that doesn’t make sense either.

3) How are “logical” OpenCL threads mapped to real CPU software/hardware(Intel HTT) threads? Because if I have short running kernels and the kernels are forked like in TBB (Thread Building Blocks) or OpenMP then the fork overhead will dominate.

4) What is the thread fork overhead? Are there new CPU threads forked for every “logical” OpenCL threads or are the CPU threads forked once, and reused for more “logical” OpenCL threads?

I hope I’m not the only one who is interested in these tiny things and some of you might now bits of these problems. Thank you in advance!


UPDATE

3) At the moment OpenCL overhead is more significant then OpenMP, so heavy kernels are required for efficient runtime execution. In Intel OpenCL a work-group is mapped to an TBB thread, so 1 virtual CPU core executes a whole work-group (or thread block). A work-group is implemented with 3 nested for loops, where the inner most loop is vectorized, if possible. So you could imagine it something like:

#pragam omp parallel for
for(wg=0; wg < get_num_groups(2)*get_num_groups(1)*get_num_groups(0); wg++) {

  for(k=0; k<get_local_size(2); k++) {
    for(j=0; j<get_local_size(1); j++) {
      #pragma simd
      for(i=0; i<get_local_size(0); i++) {
        ... work-load...
      }
    }
  }
}

If the inner most loop can be vectorized it steps with SIMD steps:

for(i=0; i<get_local_size(0); i+=SIMD) {

4) Every TBB thread is forked once during the OpenCL execution and they are reused. Every TBB thread is tied to a virtual core, ie. there is no thread migration during the computation.

I also accept @natchouf-s answer.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T11:47:30+00:00Added an answer on June 14, 2026 at 11:47 am

    I may have a few hints to your questions. In my little experience, a good OpenCL implementation tuned for the CPU can’t beat a good OpenMP implementation. If it does, you could probably improve the OpenMP code to beat the OpenCL one.

    1) It is very important to have vectorized kernels. It is linked to your question number 3 and 4. If you have a kernel that handles 4 or 8 input values, you’ll have much less work items (threads), and hence much less overhead. I recommend to use the vector instructions and data provided by OpenCL (like float4, float8, float16) instead of relying on auto-vectorization.
    Do not hesitate to use float16 (or double16): this will be mapped to 4 sse or 2 avx vectors and will divide by 16 the number of work items required (which is good for CPU, but not always for GPU: I use 2 different kernels for CPU and GPU).

    2) local memory on CPU is the RAM. Don’t use it on a CPU kernel.

    3 and 4) I don’t really know, it will depend on the implementation, but the fork overhead seems important to me.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have some questions concering routing with Codeigniter. What I´m doing now is the
I have some questions regarding the following css that I found: html, body {
I have some questions about importing data from Excel/CSV File into SQL Server. Let
I have some questions with relationship of ef code first. My code: public class
I have some questions, maybe stupid question. I have this url: http://flibusta.net/opds/opensearch?searchTerm=Тол&searchType=books and I
i have some questions about constructors in ColdFusion : must i use the name
I have some questions about using MySQLi queries, and related memory management. Suppose I
I have some questions about the default values in a function parameter list Is
I have some questions which are as follows: How can I use the JSP
I have some questions about vector in STL to clarify..... Where are the objects

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.