I implemented a parallel merge sort algorithm from Cormen’s well-known text. I wrote it in C using pthreads, and compiled with MinGW on Win7 x64 (also tested later with GCC in Ubuntu with same results). My first approach at the parallelization was naïve… I spawned a new thread at every recursion level (which is actually what Cormen’s pseudocode implies). However this usually ends up either taking way too long or crashing due to segmentation fault (I can assume there is some hard limit to how many threads the system can handle). This seems to be a common newbie mistake for recursive parallelization, in fact I found a similar DISCUSSION on this site. So I instead used the recommendation in that thread, namely setting a threshold for problem size, and if the function that spawns new threads is given a set smaller than the threshold (say 10,000 elements) then it just operates on the elements directly, rather than creating a new thread for such a small set.
Now everything seemed to be working fine. I tabulated some of my results below. N is problem size (a set of integers [1, 2, 3, …, N] thoroughly scrambled) and threshold is the value below which my parallel sort and parallel merge functions refuse to spawn new threads. The first table shows sort time in ms, the second shows how many sort/merge worker threads were spawned in each case. Looking at the N=1E6 and N=1E7 rows in the bottom table, you can see that anytime I lower the threshold such that more than ~8000 merge workers are allowed, I get segmentation fault. Again, I assume that is due to some limit the system gives on threads, and I’d be happy to hear more about that, but it’s not my main question.
The main question, is why does the final row get segfault when trying to use a fairly high threshold, which would have spawned an expected 15/33 worker threads (following pattern from previous rows). Surely this is not too many threads for my system to handle. The one instance which did complete (lower right cell in table) used about 1.2GB RAM (my system has 6GB), and the threaded versions never seem to take more RAM compared to the ones with 0 threads at the right of each row.
- I don’t think I am hitting any sort of heap limit… tons of RAM available and it should only take ~1GB even if it was allowed to spawn the 15/33 threads.
- I also don’t think it is a stack problem. I designed the program to use minimal stack, and I don’t think the footprint each thread would be related to problem size N at all, only the heap. I’m pretty inexperienced with this… but I did a core dump stack backtrace in gdb and the addresses from top to bottom of stack seem close enough to rule out overflow there.
- I tried reading the return values of pthread_create… in Windows I got a value of 11 a few times before the crash (but it didn’t seem to trigger the crash, since there were a few 11’s, then a few 0’s, i.e. no error, then another 11). That error code is EAGAIN, resources unavailable… but I am not sure what it really means here. Moreover, in Ubuntu the error code was 0 every time even right up to the crash.
- I tried Valgrind and got a lot of messages about memory leaks, but I am not sure those are legit since I know Valgrind requires extra resources, and I was able to get those types of errors on other problem set sizes that worked fine without Valgrind.
It’s pretty obvious it’s related to problem size and system resources… I’m hoping there’s some piece of general knowledge I’m missing that makes the answer really clear.
Any ideas? Sorry for the long wall of text… thanks if you’ve read this far! I can post the source if it seems relevant.

EDIT: Source added for reference:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>
const int N = 100000000;
const int SORT_THRESHOLD = 10000000;
const int MERGE_THRESHOLD = 10000000;
int sort_thread_count = 0;
int merge_thread_count = 0;
typedef struct s_pmergesort_args {
int *vals_in, p, r, *vals_out, s;
} pmergesort_args;
typedef struct s_pmerge_args {
int *temp, p1, r1, p2, r2, *vals_out, p3;
} pmerge_args;
void *p_merge_sort(void *v_pmsa);
void *p_merge(void *v_pma);
int binary_search(int val, int *temp, int p, int r);
int main() {
int *values, i, rand1, rand2, temp, *sorted;
long long rand1a, rand1b, rand2a, rand2b;
struct timeval start, end;
/* allocate values on heap and initialize */
values = malloc(N * sizeof(int));
sorted = malloc(N * sizeof(int));
for (i = 0; i < N; i++) {
values[i] = i + 1;
sorted[i] = 0;
}
/* scramble
* - complicated logic to maximize swapping
* - lots of testing (not shown) was done to verify optimal swapping */
srand(time(NULL));
for (i = 0; i < N/10; i++) {
rand1a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
rand1b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
rand1 = (int)((rand1a * rand1b + rand()) % N);
rand2a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
rand2b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
rand2 = (int)((rand2a * rand2b + rand()) % N);
temp = values[rand1];
values[rand1] = values[rand2];
values[rand2] = temp;
}
/* set up args for p_merge_sort */
pmergesort_args pmsa;
pmsa.vals_in = values;
pmsa.p = 0;
pmsa.r = N-1;
pmsa.vals_out = sorted;
pmsa.s = 0;
/* sort */
gettimeofday(&start, NULL);
p_merge_sort(&pmsa);
gettimeofday(&end, NULL);
/* verify sorting */
for (i = 1; i < N; i++) {
if (sorted[i] < sorted[i-1]) {
fprintf(stderr, "Error: array is not sorted.\n");
exit(0);
}
}
printf("Success: array is sorted.\n");
printf("Sorting took %dms.\n", (int)(((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))/1000));
free(values);
free(sorted);
printf("( sort threads created: %d )\n", sort_thread_count);
printf("( merge threads created: %d )\n", merge_thread_count);
return 0;
}
void *p_merge_sort(void *v_pmsa) {
pmergesort_args pmsa = *((pmergesort_args *) v_pmsa);
int *vals_in = pmsa.vals_in;
int p = pmsa.p;
int r = pmsa.r;
int *vals_out = pmsa.vals_out;
int s = pmsa.s;
int n = r - p + 1;
pthread_t worker;
if (n > SORT_THRESHOLD) {
sort_thread_count++;
}
if (n == 1) {
vals_out[s] = vals_in[p];
} else {
int *temp = malloc(n * sizeof(int));
int q = (p + r) / 2;
int q_ = q - p + 1;
pmergesort_args pmsa_l;
pmsa_l.vals_in = vals_in;
pmsa_l.p = p;
pmsa_l.r = q;
pmsa_l.vals_out = temp;
pmsa_l.s = 0;
pmergesort_args pmsa_r;
pmsa_r.vals_in = vals_in;
pmsa_r.p = q+1;
pmsa_r.r = r;
pmsa_r.vals_out = temp;
pmsa_r.s = q_;
if (n > SORT_THRESHOLD) {
pthread_create(&worker, NULL, p_merge_sort, &pmsa_l);
} else {
p_merge_sort(&pmsa_l);
}
p_merge_sort(&pmsa_r);
if (n > SORT_THRESHOLD) {
pthread_join(worker, NULL);
}
pmerge_args pma;
pma.temp = temp;
pma.p1 = 0;
pma.r1 = q_ - 1;
pma.p2 = q_;
pma.r2 = n - 1;
pma.vals_out = vals_out;
pma.p3 = s;
p_merge(&pma);
free(temp);
}
}
void *p_merge(void *v_pma) {
pmerge_args pma = *((pmerge_args *) v_pma);
int *temp = pma.temp;
int p1 = pma.p1;
int r1 = pma.r1;
int p2 = pma.p2;
int r2 = pma.r2;
int *vals_out = pma.vals_out;
int p3 = pma.p3;
int n1 = r1 - p1 + 1;
int n2 = r2 - p2 + 1;
int q1, q2, q3, t;
pthread_t worker;
if (n1 < n2) {
t = p1; p1 = p2; p2 = t;
t = r1; r1 = r2; r2 = t;
t = n1; n1 = n2; n2 = t;
}
if (n1 > MERGE_THRESHOLD) {
merge_thread_count++;
}
if (n1 == 0) {
return;
} else {
q1 = (p1 + r1) / 2;
q2 = binary_search(temp[q1], temp, p2, r2);
q3 = p3 + (q1 - p1) + (q2 - p2);
vals_out[q3] = temp[q1];
pmerge_args pma_l;
pma_l.temp = temp;
pma_l.p1 = p1;
pma_l.r1 = q1-1;
pma_l.p2 = p2;
pma_l.r2 = q2-1;
pma_l.vals_out = vals_out;
pma_l.p3 = p3;
if (n1 > MERGE_THRESHOLD) {
pthread_create(&worker, NULL, p_merge, &pma_l);
} else {
p_merge(&pma_l);
}
pmerge_args pma_r;
pma_r.temp = temp;
pma_r.p1 = q1+1;
pma_r.r1 = r1;
pma_r.p2 = q2;
pma_r.r2 = r2;
pma_r.vals_out = vals_out;
pma_r.p3 = q3+1;
p_merge(&pma_r);
if (n1 > MERGE_THRESHOLD) {
pthread_join(worker, NULL);
}
}
}
int binary_search(int val, int *temp, int p, int r) {
int low = p;
int mid;
int high = (p > r+1)? p : r+1;
while (low < high) {
mid = (low + high) / 2;
if (val <= temp[mid]) {
high = mid;
} else {
low = mid + 1;
}
}
return high;
}
EDIT 2: Added new image below showing “max” and “total” RAM used by each version (max meaning highest simultaneous allocation/usage and total meaning the sum of all allocation requests through the program’s life). These suggest that with N=1E8 and threshold=1E7 I should get a max usage of 3.2GB (which my system should be able to support). But again… I guess it is related to some other limitation in the pthread library… not my actual system resources.

Looks like it is running out of memory. In your example, if the code is run sequentially, then the most memory it has allocated at one time is 1.6GB. When using threads, it is using more than 3GB. I put some wrappers around the malloc/free functions, and got this result:
It’s easy to see that the memory usage would be more when threaded. In that case, it would be simultaneously sorting both the left and right sides of the overall array, and allocating two large temp buffers to do it. When run sequentially, the temp buffer for the left half would be freed before sorting the right half.
Here are the wrappers I used: