Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8193853
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T04:37:25+00:00 2026-06-07T04:37:25+00:00

I implemented a parallel merge sort algorithm from Cormen’s well-known text. I wrote it

  • 0

I implemented a parallel merge sort algorithm from Cormen’s well-known text. I wrote it in C using pthreads, and compiled with MinGW on Win7 x64 (also tested later with GCC in Ubuntu with same results). My first approach at the parallelization was naïve… I spawned a new thread at every recursion level (which is actually what Cormen’s pseudocode implies). However this usually ends up either taking way too long or crashing due to segmentation fault (I can assume there is some hard limit to how many threads the system can handle). This seems to be a common newbie mistake for recursive parallelization, in fact I found a similar DISCUSSION on this site. So I instead used the recommendation in that thread, namely setting a threshold for problem size, and if the function that spawns new threads is given a set smaller than the threshold (say 10,000 elements) then it just operates on the elements directly, rather than creating a new thread for such a small set.

Now everything seemed to be working fine. I tabulated some of my results below. N is problem size (a set of integers [1, 2, 3, …, N] thoroughly scrambled) and threshold is the value below which my parallel sort and parallel merge functions refuse to spawn new threads. The first table shows sort time in ms, the second shows how many sort/merge worker threads were spawned in each case. Looking at the N=1E6 and N=1E7 rows in the bottom table, you can see that anytime I lower the threshold such that more than ~8000 merge workers are allowed, I get segmentation fault. Again, I assume that is due to some limit the system gives on threads, and I’d be happy to hear more about that, but it’s not my main question.

The main question, is why does the final row get segfault when trying to use a fairly high threshold, which would have spawned an expected 15/33 worker threads (following pattern from previous rows). Surely this is not too many threads for my system to handle. The one instance which did complete (lower right cell in table) used about 1.2GB RAM (my system has 6GB), and the threaded versions never seem to take more RAM compared to the ones with 0 threads at the right of each row.

  • I don’t think I am hitting any sort of heap limit… tons of RAM available and it should only take ~1GB even if it was allowed to spawn the 15/33 threads.
  • I also don’t think it is a stack problem. I designed the program to use minimal stack, and I don’t think the footprint each thread would be related to problem size N at all, only the heap. I’m pretty inexperienced with this… but I did a core dump stack backtrace in gdb and the addresses from top to bottom of stack seem close enough to rule out overflow there.
  • I tried reading the return values of pthread_create… in Windows I got a value of 11 a few times before the crash (but it didn’t seem to trigger the crash, since there were a few 11’s, then a few 0’s, i.e. no error, then another 11). That error code is EAGAIN, resources unavailable… but I am not sure what it really means here. Moreover, in Ubuntu the error code was 0 every time even right up to the crash.
  • I tried Valgrind and got a lot of messages about memory leaks, but I am not sure those are legit since I know Valgrind requires extra resources, and I was able to get those types of errors on other problem set sizes that worked fine without Valgrind.

It’s pretty obvious it’s related to problem size and system resources… I’m hoping there’s some piece of general knowledge I’m missing that makes the answer really clear.

Any ideas? Sorry for the long wall of text… thanks if you’ve read this far! I can post the source if it seems relevant.

enter image description here

EDIT: Source added for reference:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>

const int               N = 100000000;
const int  SORT_THRESHOLD = 10000000;
const int MERGE_THRESHOLD = 10000000;

int  sort_thread_count = 0;
int merge_thread_count = 0;

typedef struct s_pmergesort_args {
    int *vals_in, p, r, *vals_out, s;
} pmergesort_args;

typedef struct s_pmerge_args {
    int *temp, p1, r1, p2, r2, *vals_out, p3;
} pmerge_args;

void *p_merge_sort(void *v_pmsa);
void *p_merge(void *v_pma);
int binary_search(int val, int *temp, int p, int r);

int main() {
    int *values, i, rand1, rand2, temp, *sorted;
    long long rand1a, rand1b, rand2a, rand2b;
    struct timeval start, end;

    /* allocate values on heap and initialize */
    values = malloc(N * sizeof(int));
    sorted = malloc(N * sizeof(int));
    for (i = 0; i < N; i++) {
        values[i] = i + 1;
        sorted[i] = 0;
    }

    /* scramble
     *  - complicated logic to maximize swapping
     *  - lots of testing (not shown) was done to verify optimal swapping */
    srand(time(NULL));
    for (i = 0; i < N/10; i++) {
        rand1a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand1b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand1 = (int)((rand1a * rand1b + rand()) % N);
        rand2a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand2b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand2 = (int)((rand2a * rand2b + rand()) % N);
        temp = values[rand1];
        values[rand1] = values[rand2];
        values[rand2] = temp;
    }

    /* set up args for p_merge_sort */
    pmergesort_args pmsa;
    pmsa.vals_in = values;
    pmsa.p = 0;
    pmsa.r = N-1;
    pmsa.vals_out = sorted;
    pmsa.s = 0;

    /* sort */
    gettimeofday(&start, NULL);
    p_merge_sort(&pmsa);
    gettimeofday(&end, NULL);

    /* verify sorting */
    for (i = 1; i < N; i++) {
        if (sorted[i] < sorted[i-1]) {
            fprintf(stderr, "Error: array is not sorted.\n");
            exit(0);
        }
    }
    printf("Success: array is sorted.\n");
    printf("Sorting took %dms.\n", (int)(((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))/1000));

    free(values);
    free(sorted);

    printf("(  sort threads created: %d )\n", sort_thread_count);
    printf("( merge threads created: %d )\n", merge_thread_count);

    return 0;
}

void *p_merge_sort(void *v_pmsa) {
    pmergesort_args pmsa = *((pmergesort_args *) v_pmsa);
    int *vals_in = pmsa.vals_in;
    int p = pmsa.p;
    int r = pmsa.r;
    int *vals_out = pmsa.vals_out;
    int s = pmsa.s;

    int n = r - p + 1;
    pthread_t worker;

    if (n > SORT_THRESHOLD) {
        sort_thread_count++;
    }

    if (n == 1) {
        vals_out[s] = vals_in[p];
    } else {
        int *temp = malloc(n * sizeof(int));
        int q = (p + r) / 2;
        int q_ = q - p + 1;

        pmergesort_args pmsa_l;
        pmsa_l.vals_in = vals_in;
        pmsa_l.p = p;
        pmsa_l.r = q;
        pmsa_l.vals_out = temp;
        pmsa_l.s = 0;

        pmergesort_args pmsa_r;
        pmsa_r.vals_in = vals_in;
        pmsa_r.p = q+1;
        pmsa_r.r = r;
        pmsa_r.vals_out = temp;
        pmsa_r.s = q_;

        if (n > SORT_THRESHOLD) {
            pthread_create(&worker, NULL, p_merge_sort, &pmsa_l);
        } else {
            p_merge_sort(&pmsa_l);
        }
        p_merge_sort(&pmsa_r);

        if (n > SORT_THRESHOLD) {
            pthread_join(worker, NULL);
        }

        pmerge_args pma;
        pma.temp = temp;
        pma.p1 = 0;
        pma.r1 = q_ - 1;
        pma.p2 = q_;
        pma.r2 = n - 1;
        pma.vals_out = vals_out;
        pma.p3 = s;
        p_merge(&pma);
        free(temp);
    }
}

void *p_merge(void *v_pma) {
    pmerge_args pma = *((pmerge_args *) v_pma);
    int *temp = pma.temp;
    int p1 = pma.p1;
    int r1 = pma.r1;
    int p2 = pma.p2;
    int r2 = pma.r2;
    int *vals_out = pma.vals_out;
    int p3 = pma.p3;

    int n1 = r1 - p1 + 1;
    int n2 = r2 - p2 + 1;
    int q1, q2, q3, t;
    pthread_t worker;

    if (n1 < n2) {
        t = p1; p1 = p2; p2 = t;
        t = r1; r1 = r2; r2 = t;
        t = n1; n1 = n2; n2 = t;
    }
    if (n1 > MERGE_THRESHOLD) {
        merge_thread_count++;
    }

    if (n1 == 0) {
        return;
    } else {

        q1 = (p1 + r1) / 2;
        q2 = binary_search(temp[q1], temp, p2, r2);
        q3 = p3 + (q1 - p1) + (q2 - p2);
        vals_out[q3] = temp[q1];

        pmerge_args pma_l;
        pma_l.temp = temp;
        pma_l.p1 = p1;
        pma_l.r1 = q1-1;
        pma_l.p2 = p2;
        pma_l.r2 = q2-1;
        pma_l.vals_out = vals_out;
        pma_l.p3 = p3;

        if (n1 > MERGE_THRESHOLD) {
            pthread_create(&worker, NULL, p_merge, &pma_l);
        } else {        
            p_merge(&pma_l);
        }        

        pmerge_args pma_r;
        pma_r.temp = temp;
        pma_r.p1 = q1+1;
        pma_r.r1 = r1;
        pma_r.p2 = q2;
        pma_r.r2 = r2;
        pma_r.vals_out = vals_out;
        pma_r.p3 = q3+1;

        p_merge(&pma_r);

        if (n1 > MERGE_THRESHOLD) {
            pthread_join(worker, NULL);
        }
    }
}

int binary_search(int val, int *temp, int p, int r) {
    int low = p;
    int mid;
    int high = (p > r+1)? p : r+1;

    while (low < high) {
        mid = (low + high) / 2;
        if (val <= temp[mid]) {
            high = mid;
        } else {
            low = mid + 1;
        }
    }
    return high;
}

EDIT 2: Added new image below showing “max” and “total” RAM used by each version (max meaning highest simultaneous allocation/usage and total meaning the sum of all allocation requests through the program’s life). These suggest that with N=1E8 and threshold=1E7 I should get a max usage of 3.2GB (which my system should be able to support). But again… I guess it is related to some other limitation in the pthread library… not my actual system resources.

enter image description here

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T04:37:27+00:00Added an answer on June 7, 2026 at 4:37 am

    Looks like it is running out of memory. In your example, if the code is run sequentially, then the most memory it has allocated at one time is 1.6GB. When using threads, it is using more than 3GB. I put some wrappers around the malloc/free functions, and got this result:

    Allocation of 12500000 bytes failed with 3074995884 bytes already allocated.
    

    It’s easy to see that the memory usage would be more when threaded. In that case, it would be simultaneously sorting both the left and right sides of the overall array, and allocating two large temp buffers to do it. When run sequentially, the temp buffer for the left half would be freed before sorting the right half.

    Here are the wrappers I used:

    static size_t total_allocated = 0;
    static size_t max_allocated = 0;
    static pthread_mutex_t total_allocated_mutex;
    
    static void *allocate(int n)
    {
      void *result = 0;
      pthread_mutex_lock(&total_allocated_mutex);
      result = malloc(n);
      if (!result) {
        fprintf(stderr,"Allocation of %d bytes failed with %u bytes already allocated\n",n,total_allocated);
      }
      assert(result);
      total_allocated += n;
      if (total_allocated>max_allocated) {
        max_allocated = total_allocated;
      }
      pthread_mutex_unlock(&total_allocated_mutex);
      return result;
    }
    
    
    static void *deallocate(void *p,int n)
    {
      pthread_mutex_lock(&total_allocated_mutex);
      total_allocated -= n;
      free(p);
      pthread_mutex_unlock(&total_allocated_mutex);
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have implemented a simple parallel merge sort algorithm in Java. This cuts the
I have implemented serial and parallel algorithm for solving linear systems using jacobi method.
My parallel program is implemented by using c++ and OpenMPI. When I test it,
I have some parallel code (implemented using MPI) that needs to be documented. I'd
I have a long-running Task that I've implemented using the Task Parallel Library. When
I want access a STL based container read-only from parallel running threads. Without using
I would like to implement a parallel version of the code below using threads
I wanted to implement a simple python program using parallel execution. It's I/O bound,
I've implemented the Barnes-Hut gravity algorithm in C as follows: Build a tree of
I'm designing a CUDA app to process some video. The algorithm I'm using calls

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.