Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9064577
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T16:17:22+00:00 2026-06-16T16:17:22+00:00

can you please help me to find out if it takes longer for a

  • 0

can you please help me to find out if it takes longer for a cache write to finish when there are more cores/caches holding a copy of that line.
I also want to measure/quantify how much longer it actually takes.

I couldn’t find anything useful on google and I have trouble measuring it myself plus interpret what I measure because of the many things that can happen on a modern processor.
(reordering, prefetching, buffering and god knows what)

Details:

My basic process of measuring it is roughly as follows:

write soemthing to the cacheline on processor 0
read it on processors 1 to n.

rdtsc
write it on process 0
rdtsc

I am not even sure which instructions to actually use for read/write on process 0 in order to make sure the write/invalidate is finished before the final time measurement.

At the moment I fiddle with an atomic exchange (__sync_fetch_and_add()), but it seems that the number of threads is itself important for the length of this operation (not the number of threads to invalidate) — which is probably not what I want to measure?!.

I also tried a read, then write, then memory barrier (__sync_synchronize()). This looks more like what I expect to see,
but here I am also not sure if the write is finished when the final rdtsc takes place.

As you can guess my knowledge of CPU internals is somewhat limited.

Any help is very much appreciated!

ps:
* I use linux, gcc and pthreads for the measurements.
* I want know this for modeling a parallel algorithm of mine.

Edit:

In a week or so (going on vacation tomorrow) I’ll do some more research and post my code and notes and link it here (In case someone is interested), because the time I can spend on this is limited.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T16:17:23+00:00Added an answer on June 16, 2026 at 4:17 pm

    I started writing a very long answer, describing exactly how this works, then realized, I probably don’t know enough about the exact details. So I’ll do a shorter answer….

    So, when you write something on one processor, if it’s not already in that processors cache, it will have to be fetched in, and after the processor has read the data, it will perform the actual write. In doing so, it will send a cache-invalidate message to ALL other processors in the system. These will then throw away any content. If another processor has “dirty” content, it will in itself write out the data, and ask for an invalidation – in which case the first processor will have to RELOAD the data before finishing its write (otherwise, some other element in the same cacheline may get destroyed).

    Reading it back into the cache will be required on every other processor that is interested in that cache-line.

    The __sync_fetch_and_add() wilol use a “lock” prefix [on x86, other processors may vary, but the general idea on processors that support “per instruction” locks is roughtly the same] – this will issue a “I want this cacheline EXCLUSIVELY, everyone else please give it up and invalidate it”. Just like the first case, the processor may well have to re-read anything that another processor may have made dirty.

    A memory barrier will not ensure that data is updated “safely” – it will just make sure that “whatever happened (to memory) before now is visible to all processors by the time this instructon finishes”.

    The best way to optimize the use of processors is to share as little as possible, and in particular, avoid “false sharing”. In a benchmark many years ago, there was a structure like [simplifed] this:

    struct stuff {
        int x[2];
        ... other data ... total data a few cachelines. 
    } data;
    
    void thread1()
    {
        for( ... big number ...)
            data.x[0]++;
    }
    
    void thread2()
    {
        for( ... big number ...)
            data.x[1]++;
    }
    
    int main()
    {
        start = timenow();
    
        create(thread1);
        create(thread2);
    
        end = timenow() - start;   
    }
    

    Since EVERY time thread1 wrote to the x[0], thread2’s processor had to get rid of it’s copy of x[1], and vice versa, the result is was that the SMP test [vs just running thread1] was running about 15 times slower. By altering the struct like this:

    struct stuff {
        int x;
        ... other data ... 
    } data[2];
    

    and

    void thread1()
    {
        for( ... big number ...)
            data[0].x++;
    }
    

    we got 200% of the 1 thread variant [give or take a few percent]

    Right, so the processor has queues of buffers where write operations are stored when the processor is writing to memory. A memory barrier (mfence, sfence or lfence) instruction is there to ensure that any outstanding read/write, write or read type operation has completely been finished before the processor proceeds to the next instruction. Normally, the processor would just continue on it’s jolly way through any following instructions, and eventualy the memory operation becomes fulfilled some way or another. Since modern processors have a lot of parallel operations and buffers all over the place, it can take quite some time before something ACTUALLY trickles through to where it eventually will end up. So, when it’s CRITICAL to make sure that something has ACTUALLY been done before proceeding (for example, if we have written a bunch of instructions to the video memory, and we now want to kick off the run of those instructions, we need to make sure that the ‘instruction’ writing has actually finished, and some other part of the processor isn’t still working on finishing that. So use an sfence to make sure that the write has really happened – that may not be a very realistic example, but I think you get the idea.)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Update: How to create a out variable?Can anyone please help me find out what
in my application findlocation opptions is there.can u please help me how to find
Can someone please help me find out an enterprise standard spring application with explanation
Can you please help me to write a sed command to remove the price
Can someone please help to clarify? Also, please mention if there are other representation
can someone please help me out? I'm trying to create an input dynamically with
First of all, please help me out! I can not take this anymore. I
Can someone please help me find the memory leak which is occurring here? I
Any body please can please help me, how to center a JFrame on Mac.
Can you please help to derive a regular expression that matches the bold-italics portion

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.