Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8640777
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T11:21:46+00:00 2026-06-12T11:21:46+00:00

I have an application level (PThreads) question regarding choice of hardware and its impact

  • 0

I have an application level (PThreads) question regarding choice of hardware and its impact on software development.

I have working multi-threaded code tested well on a multi-core single CPU box.

I am trying to decide what to purchase for my next machine:

  • A 6-core single CPU box
  • A 4-core dual CPU box

My question is, if I go for the dual CPU box, will that impact the porting of my code in a serious way? Or can I just allocate more threads and let the OS handle the rest?

In other words, is multiprocessor programming any different from (single CPU) multithreading in the context of a PThreads application?

I thought it would make no difference at this level, but when configuring a new box, I noticed that one has to buy separate memory for each CPU. That’s when I hit some cognitive dissonance.

More Detail Regarding the Code (for those who are interested): I read a ton of data from disk into a huge chunk of memory (~24GB soon to be more), then I spawn my threads. That initial chunk of memory is “read-only” (enforced by my own code policies) so I don’t do any locking for that chunk. I got confused as I was looking at 4-core dual CPU boxes – they seem to require separate memory. In the context of my code, I have no idea what will happen “under the hood” if I allocate a bunch of extra threads. Will the OS copy my chunk of memory from one CPU’s memory bank to another? This would impact how much memory I would have to buy (raising the cost for this configuration). The ideal situation (cost-wise and ease-of-programming-wise) is to have the dual CPU share one large bank of memory, but if I understand correctly, this may not be possible on the new Intel dual core MOBOs (like the HP ProLiant ML350e)?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T11:21:47+00:00Added an answer on June 12, 2026 at 11:21 am

    Modern CPUs1 handle RAM locally and use a separate channel2 to communicate between them. This is a consumer-level version of the NUMA architecture, created for supercomputers more than a decade ago.

    The idea is to avoid a shared bus (the old FSB) that can cause heavy contention because it’s used by every core to access memory. As you add more NUMA cells, you get higher bandwidth. The downside is that memory becomes non-uniform from the point of view of the CPU: some RAM is faster than others.

    Of course, modern OS schedulers are NUMA-aware, so they try to reduce the migration of a task from one cell to another. Sometimes it’s okay to move from one core to another in the same socket; sometimes there’s a whole hierarchy specifying which resources (1-,2-,3-level cache, RAM channel, IO, etc) are shared and which aren’t, and that determines if there would be a penalty or not by moving the task. Sometimes it can determine that waiting for the right core would be pointless and it’s better to shovel the whole thing to another socket….

    In the vast majority of cases, it’s best to leave the scheduler do what it knows best. If not, you can play around with numactl.

    As for the specific case of a given program; the best architecture depends heavily in the level of resource sharing between threads. If each thread has its own playground and mostly works alone within it, a smart enough allocator would prioritize local RAM, making it less important on which cell each thread happens to be.

    If, on the other hand, objects are allocated by one thread, processed by another and consumed by a third; performance would suffer if they’re not on the same cell. You could try to create small thread groups and limit heavy sharing within the group, then each group could go on a different cell without problem.

    The worst case is when all threads participate in a great orgy of data sharing. Even if you have all your locks and processes well debugged, there won’t be any way to optimize it to use more cores than what are available on a cell. It might even be best to limit the whole process to just use the cores in a single cell, effectively wasting the rest.

    1 by modern, I mean any AMD-64bit chip, and Nehalem or better for Intel.

    2 AMD calls this channel HyperTransport, and Intel name is QuickPath Interconnect

    EDIT:

    You mention that you initialize “a big chunk of read-only memory”. And then spawn a lot of threads to work on it. If each thread works on its own part of that chunk, then it would be a lot better if you initialize it on the thread, after spawning it. That would allow the threads to spread to several cores, and the allocator would choose local RAM for each, a much more effective layout. Maybe there’s some way to hint the scheduler to migrate away the threads as soon as they’re spawned, but I don’t know the details.

    EDIT 2:

    If your data is read verbatim from disk, without any processing, it might be advantageous to use mmap instead of allocating a big chunk and read()ing. There are some common advantages:

    1. No need to preallocate RAM.
    2. The mmap operation is almost instantaneous and you can start using it. The data will be read lazily as needed.
    3. The OS can be way smarter than you when choosing between application, mmaped RAM, buffers and cache.
    4. it’s less code!
    5. Non needed data won’t be read, won’t use up RAM.
    6. You can specifically mark as read-only. Any bug that tries to write will cause a coredump.
    7. Since the OS knows it’s read-only, it can’t be ‘dirty’, so if the RAM is needed, it will simply discard it, and reread when needed.

    but in this case, you also get:

    • Since data is read lazily, each RAM page would be chosen after the threads have spread on all available cores; this would allow the OS to choose pages close to the process.

    So, I think that if two conditions hold:

    • the data isn’t processed in any way between disk and RAM
    • each part of the data is read (mostly) by one single thread, not touched by all of them.

    then, just by using mmap, you should be able to take advantage of machines of any size.

    If each part of the data is read by more than one single thread, maybe you could identify which threads will (mostly) share the same pages, and try to hint the scheduler to keep these in the same NUMA cell.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

We have an application that uses Hibernate's 2nd level caching to avoid database hits.
So I have an iPhone application running that is controlled at the highest level
My application has several independent top-level windows, which all have completely different functions/workflows. I
I have Application settings stored under HKEY_LOCAL_MACHINE\SOFTWARE\MyCompany branch. Settings must be same for different
I have developed an application which is working fine. In that i have used
I am working on storing different user counters in memory at application level (not
I have an application that receives relatively sparse traffic over TCP with no application-level
I know that Android doesn't have an Application-level onPause the way an Activity has
I have application consuming a SOAP service that uses transport-level authentication. We are trying
On clicking the Submit button I have to do application/business level validations and associate

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.