I have an application level (PThreads) question regarding choice of hardware and its impact

Question

0

Editorial Team

Asked: June 12, 20262026-06-12T11:21:46+00:00 2026-06-12T11:21:46+00:00

I have an application level (PThreads) question regarding choice of hardware and its impact

0

I have an application level (PThreads) question regarding choice of hardware and its impact on software development.

I have working multi-threaded code tested well on a multi-core single CPU box.

I am trying to decide what to purchase for my next machine:

A 6-core single CPU box
A 4-core dual CPU box

My question is, if I go for the dual CPU box, will that impact the porting of my code in a serious way? Or can I just allocate more threads and let the OS handle the rest?

In other words, is multiprocessor programming any different from (single CPU) multithreading in the context of a PThreads application?

I thought it would make no difference at this level, but when configuring a new box, I noticed that one has to buy separate memory for each CPU. That’s when I hit some cognitive dissonance.

More Detail Regarding the Code (for those who are interested): I read a ton of data from disk into a huge chunk of memory (~24GB soon to be more), then I spawn my threads. That initial chunk of memory is “read-only” (enforced by my own code policies) so I don’t do any locking for that chunk. I got confused as I was looking at 4-core dual CPU boxes – they seem to require separate memory. In the context of my code, I have no idea what will happen “under the hood” if I allocate a bunch of extra threads. Will the OS copy my chunk of memory from one CPU’s memory bank to another? This would impact how much memory I would have to buy (raising the cost for this configuration). The ideal situation (cost-wise and ease-of-programming-wise) is to have the dual CPU share one large bank of memory, but if I understand correctly, this may not be possible on the new Intel dual core MOBOs (like the HP ProLiant ML350e)?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T11:21:47+00:00

Modern CPUs¹ handle RAM locally and use a separate channel² to communicate between them. This is a consumer-level version of the NUMA architecture, created for supercomputers more than a decade ago.

The idea is to avoid a shared bus (the old FSB) that can cause heavy contention because it’s used by every core to access memory. As you add more NUMA cells, you get higher bandwidth. The downside is that memory becomes non-uniform from the point of view of the CPU: some RAM is faster than others.

Of course, modern OS schedulers are NUMA-aware, so they try to reduce the migration of a task from one cell to another. Sometimes it’s okay to move from one core to another in the same socket; sometimes there’s a whole hierarchy specifying which resources (1-,2-,3-level cache, RAM channel, IO, etc) are shared and which aren’t, and that determines if there would be a penalty or not by moving the task. Sometimes it can determine that waiting for the right core would be pointless and it’s better to shovel the whole thing to another socket….

In the vast majority of cases, it’s best to leave the scheduler do what it knows best. If not, you can play around with numactl.

As for the specific case of a given program; the best architecture depends heavily in the level of resource sharing between threads. If each thread has its own playground and mostly works alone within it, a smart enough allocator would prioritize local RAM, making it less important on which cell each thread happens to be.

If, on the other hand, objects are allocated by one thread, processed by another and consumed by a third; performance would suffer if they’re not on the same cell. You could try to create small thread groups and limit heavy sharing within the group, then each group could go on a different cell without problem.

The worst case is when all threads participate in a great orgy of data sharing. Even if you have all your locks and processes well debugged, there won’t be any way to optimize it to use more cores than what are available on a cell. It might even be best to limit the whole process to just use the cores in a single cell, effectively wasting the rest.

¹ by modern, I mean any AMD-64bit chip, and Nehalem or better for Intel.

² AMD calls this channel HyperTransport, and Intel name is QuickPath Interconnect

EDIT:

You mention that you initialize “a big chunk of read-only memory”. And then spawn a lot of threads to work on it. If each thread works on its own part of that chunk, then it would be a lot better if you initialize it on the thread, after spawning it. That would allow the threads to spread to several cores, and the allocator would choose local RAM for each, a much more effective layout. Maybe there’s some way to hint the scheduler to migrate away the threads as soon as they’re spawned, but I don’t know the details.

EDIT 2:

If your data is read verbatim from disk, without any processing, it might be advantageous to use mmap instead of allocating a big chunk and read()ing. There are some common advantages:

No need to preallocate RAM.
The mmap operation is almost instantaneous and you can start using it. The data will be read lazily as needed.
The OS can be way smarter than you when choosing between application, mmaped RAM, buffers and cache.
it’s less code!
Non needed data won’t be read, won’t use up RAM.
You can specifically mark as read-only. Any bug that tries to write will cause a coredump.
Since the OS knows it’s read-only, it can’t be ‘dirty’, so if the RAM is needed, it will simply discard it, and reread when needed.

but in this case, you also get:

Since data is read lazily, each RAM page would be chosen after the threads have spread on all available cores; this would allow the OS to choose pages close to the process.

So, I think that if two conditions hold:

the data isn’t processed in any way between disk and RAM
each part of the data is read (mostly) by one single thread, not touched by all of them.

then, just by using mmap, you should be able to take advantage of machines of any size.

If each part of the data is read by more than one single thread, maybe you could identify which threads will (mostly) share the same pages, and try to hint the scheduler to keep these in the same NUMA cell.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an application level (PThreads) question regarding choice of hardware and its impact

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply