Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6129193
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T16:42:50+00:00 2026-05-23T16:42:50+00:00

I am a fairly experienced OpenMP user, but I have just run into a

  • 0

I am a fairly experienced OpenMP user, but I have just run into a puzzling problem, and I am hopeful that someone here could help. The problem is that a simple hashing algorithm performs well for stack-allocated arrays, but poorly for arrays on the heap.

Example below uses i%M (i modulus M) to count every M-th integer in respective array element. For simplicity, imagine N=1000000, M=10. If N%M==0, then the result should be that every element of bins[] is equal to N/M:

#pragma omp for
  for (int i=0; i<N; i++) 
    bins[ i%M ]++;

Array bins[] is private to each thread (I sum results of all threads in a critical section afterwards).

When bins[] is allocated on the stack, the program works great, with performance scaling proportionally to the number of cores.

However, if bins[] is on the heap (pointer to bins[] is on the stack), performance drops drastically. And that is a major problem!

I want parallelize binning (hashing) of certain data into heap arrays with OpenMP, and this is a major performance hit.

It is definitely not something silly like all threads trying to write into the same area of memory.
That is because each thread has its own bins[] array, results are correct with both heap- and stack-allocated bins, and there is no difference in performance for single-thread runs.
I reproduced the problem on different hardware (Intel Xeon and AMD Opteron), with GCC and Intel C++ compilers. All tests were on Linux (Ubuntu and RedHat).

There seems no reason why good performance of OpenMP should be limited to stack arrays.

Any guesses? Maybe access of threads to the heap goes through some kind of shared gateway on Linux? How do I fix that?

Complete program to play around with is below:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main(const int argc, const char* argv[])
{
  const int N=1024*1024*1024;
  const int M=4;
  double t1, t2;
  int checksum=0;

  printf("OpenMP threads: %d\n", omp_get_max_threads());

  //////////////////////////////////////////////////////////////////
  // Case 1: stack-allocated array
  t1=omp_get_wtime();
  checksum=0;
#pragma omp parallel
  { // Each openmp thread should have a private copy of 
    // bins_thread_stack on the stack:
    int bins_thread_stack[M];
    for (int j=0; j<M; j++) bins_thread_stack[j]=0;
#pragma omp for
    for (int i=0; i<N; i++) 
      { // Accumulating every M-th number in respective array element
        const int j=i%M;
        bins_thread_stack[j]++;
      }
#pragma omp critical
    for (int j=0; j<M; j++) checksum+=bins_thread_stack[j];
  }
  t2=omp_get_wtime();
  printf("Time with stack array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N);
  //////////////////////////////////////////////////////////////////

  //////////////////////////////////////////////////////////////////
  // Case 2: heap-allocated array
  t1=omp_get_wtime();
  checksum=0;
  #pragma omp parallel 
  { // Each openmp thread should have a private copy of 
    // bins_thread_heap on the heap:
    int* bins_thread_heap=(int*)malloc(sizeof(int)*M); 
    for (int j=0; j<M; j++) bins_thread_heap[j]=0;
  #pragma omp for
    for (int i=0; i<N; i++) 
      { // Accumulating every M-th number in respective array element
        const int j=i%M;
        bins_thread_heap[j]++;
      }
  #pragma omp critical
    for (int j=0; j<M; j++) checksum+=bins_thread_heap[j];
    free(bins_thread_heap);
  }
  t2=omp_get_wtime();
  printf("Time with heap  array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N);
  //////////////////////////////////////////////////////////////////

  return 0;
}

Sample outputs of the program are below:

for OMP_NUM_THREADS=1

OpenMP threads: 1
Time with stack array: 2.973 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array: 3.091 sec, checksum=1073741824 (must be 1073741824).

and for OMP_NUM_THREADS=10

OpenMP threads: 10
Time with stack array: 0.329 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array: 2.150 sec, checksum=1073741824 (must be 1073741824).

I would very much appreciate any help!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T16:42:50+00:00Added an answer on May 23, 2026 at 4:42 pm

    This is a cute problem: with the code as above (gcc4.4, Intel i7) with 4 threads I get

    OpenMP threads: 4
    Time with stack array:        1.696 sec, checksum=1073741824 (must be 1073741824).
    Time with heap  array:        5.413 sec, checksum=1073741824 (must be 1073741824).
    

    but if I change the malloc line to

        int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);
    

    (Update: or even

        int* bins_thread_heap=(int*)malloc(sizeof(int)*16);
    

    )

    then I get

    OpenMP threads: 4
    Time with stack array:        1.578 sec, checksum=1073741824 (must be 1073741824).
    Time with heap  array:        1.574 sec, checksum=1073741824 (must be 1073741824).
    

    The problem here is false sharing. The default malloc is being very (space-) efficient, and putting the requested small allocations all in one block of memory, next to each other; but since the allocations are so small that multiple fit in the same cache line, that means every time one thread updates its values, it dirties the cache line of the values in neighbouring threads. By making the requested memory large enough, this is no longer an issue.

    Incidentally, it should be clear why the stack-allocated case does not see this problem; different threads – different stacks – memory far enough appart that false sharing isn’t an issue.

    As a side point — it doesn’t really matter for M of the size you’re using here, but if your M (or number of threads) were larger, the omp critical would be a big serial bottleneck; you can use OpenMP reductions to sum the checksum more efficiently

    #pragma omp parallel reduction(+:checksum)
        { // Each openmp thread should have a private copy of 
            // bins_thread_heap on the heap:
            int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);
            for (int j=0; j<M; j++) bins_thread_heap[j]=0;
    #pragma omp for
            for (int i=0; i<N; i++)
            { // Accumulating every M-th number in respective array element
                const int j=i%M;
                bins_thread_heap[j]++;
            }
            for (int j=0; j<M; j++)
                checksum+=bins_thread_heap[j];
            free(bins_thread_heap);
     }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm a fairly experienced Java user, but I've just started trying to use NetBeans
I'm a fairly experienced Wicket user but I'm making my first foray into 1.5
Fairly straightforward question here, but I can't seem to find someone who has asked
As a fairly junior developer, I'm running into a problem that highlights my lack
I am fairly experienced C programmer but IOS is new for me. I do
I have plenty of experience with this in PHP, but am fairly new to
I'm a fairly experienced .NET desktop developer, but I'm new to Silverlight. I'm trying
I'm fairly new to Silverlight but experienced in web development, and I'm finding myself
I'm fairly new to Google App Engine and Python, but I did just release
I'm a fairly experienced web developer, but I've never needed to work with video

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.