Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8175339
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T22:44:48+00:00 2026-06-06T22:44:48+00:00

Observe the following program written in Java (complete runnable version follows, but the important

  • 0

Observe the following program written in Java (complete runnable version follows, but the important part of the program is in the snippet a little bit further below):

import java.util.ArrayList;



/** A not easy to explain benchmark.
 */
class MultiVolatileJavaExperiment {

    public static void main(String[] args) {
        (new MultiVolatileJavaExperiment()).mainMethod(args);
    }

    int size = Integer.parseInt(System.getProperty("size"));
    int par = Integer.parseInt(System.getProperty("par"));

    public void mainMethod(String[] args) {
        int times = 0;
        if (args.length == 0) times = 1;
        else times = Integer.parseInt(args[0]);
        ArrayList < Long > measurements = new ArrayList < Long > ();

        for (int i = 0; i < times; i++) {
            long start = System.currentTimeMillis();
            run();
            long end = System.currentTimeMillis();

            long time = (end - start);
            System.out.println(i + ") Running time: " + time + " ms");
            measurements.add(time);
        }

        System.out.println(">>>");
        System.out.println(">>> All running times: " + measurements);
        System.out.println(">>>");
    }

    public void run() {
        int sz = size / par;
        ArrayList < Thread > threads = new ArrayList < Thread > ();

        for (int i = 0; i < par; i++) {
            threads.add(new Reader(sz));
            threads.get(i).start();
        }
        for (int i = 0; i < par; i++) {
            try {
                threads.get(i).join();
            } catch (Exception e) {}
        }
    }

    final class Foo {
        int x = 0;
    }

    final class Reader extends Thread {
        volatile Foo vfoo = new Foo();
        Foo bar = null;
        int sz;

        public Reader(int _sz) {
            sz = _sz;
        }

        public void run() {
            int i = 0;
            while (i < sz) {
                vfoo.x = 1;
                // with the following line commented
                // the scalability is almost linear
                bar = vfoo; // <- makes benchmark 2x slower for 2 processors - why?
                i++;
            }
        }
    }

}

Explanation: The program is actually very simple. It loads integers size and par from the system properties (passed to jvm with the -D flag) – these are the input length and the number of threads to use later. It then parses the first command line argument which says how many time to repeat the program (we want to be sure that the JIT has done its work and have more reliable measurements).

The run method is called in each repetition. This method simply starts par threads, each of which will do a loop with size / par iterations. The thread body is defined in the Reader class. Each repetition of the loop reads a volatile member vfoo and assigns 1 to its public field. After that, vfoo is read once again and assigned to a non-volatile field bar.

Notice how most of the time the program is executing the loop body, so the run in the thread is the focus of this benchmark:

    final class Reader extends Thread {
        volatile Foo vfoo = new Foo();
        Foo bar = null;
        int sz;

        public Reader(int _sz) {
            sz = _sz;
        }

        public void run() {
            int i = 0;
            while (i < sz) {
                vfoo.x = 1;
                // with the following line commented
                // the scalability is almost linear
                bar = vfoo; // <- makes benchmark 2x slower for 2 processors - why?
                i++;
            }
        }
    }

Observations: Running java -Xmx512m -Xms512m -server -Dsize=500000000 -Dpar=1 MultiVolatileJavaExperiment 10 on an

Ubuntu Server 10.04.3 LTS
8 core Intel(R) Xeon(R) CPU  X5355  @2.66GHz
~20GB ram
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

I get the following times:

>>> All running times: [821, 750, 1011, 750, 758, 755, 1219, 751, 751, 1012]

Now, setting -Dpar=2, I get:

>>> All running times: [1618, 380, 1476, 1245, 1390, 1391, 1445, 1393, 1511, 1508]

Apparently, this doesn’t scale for some reason – I would have expected the second output to be twice as fast (although it does seem to be in one of the early iterations – 380ms).

Interestingly, commenting out the line bar = vfoo (which isn’t even supposed to be a volatile write), yields the following times for -Dpar set to 1,2,4,8.

>>> All running times: [762, 563, 563, 563, 563, 563, 570, 566, 563, 563]
>>> All running times: [387, 287, 285, 284, 283, 281, 282, 282, 281, 282]
>>> All running times: [204, 146, 143, 142, 141, 141, 141, 141, 141, 141]
>>> All running times: [120, 78, 74, 74, 81, 75, 73, 73, 72, 71]

It scales perfectly.

Analysis: First of all, there are no garbage collection cycles occuring here (I’ve added -verbose:gc as well to check this).

I get similar results on my iMac.

Each thread is writing to its own field, and different Foo object instances belonging to different threads don’t appear to be ending up in the same cachelines – adding more members into Foo to increase its size doesn’t change the measurements. Each thread object instance has more than enough fields to fill up the L1 cache line. So this probably isn’t a memory issue.

My next thought was that the JIT might be doing something weird, because the early iterations usually do scale as expected in the uncommented version, so I checked this out by printing the assembly (see this post on how to do that).

java -Xmx512m -Xms512m -server -XX:CompileCommand=print,*Reader.run MultiVolatileJavaExperiment -Dsize=500000000 -Dpar=1 10

and I get these 2 outputs for the 2 versions for the Jitted method run in Reader. The commented (properly scalable) version:

[Verified Entry Point]
  0xf36c9fac: mov    %eax,-0x3000(%esp)
  0xf36c9fb3: push   %ebp
  0xf36c9fb4: sub    $0x8,%esp
  0xf36c9fba: mov    0x68(%ecx),%ebx
  0xf36c9fbd: test   %ebx,%ebx
  0xf36c9fbf: jle    0xf36c9fec
  0xf36c9fc1: xor    %ebx,%ebx
  0xf36c9fc3: nopw   0x0(%eax,%eax,1)
  0xf36c9fcc: xchg   %ax,%ax
  0xf36c9fd0: mov    0x6c(%ecx),%ebp
  0xf36c9fd3: test   %ebp,%ebp
  0xf36c9fd5: je     0xf36c9ff7
  0xf36c9fd7: movl   $0x1,0x8(%ebp)

---------------------------------------------

  0xf36c9fde: mov    0x68(%ecx),%ebp
  0xf36c9fe1: inc    %ebx               ; OopMap{ecx=Oop off=66}
                                        ;*goto
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@21 (line 83)

---------------------------------------------

  0xf36c9fe2: test   %edi,0xf7725000    ;   {poll}
  0xf36c9fe8: cmp    %ebp,%ebx
  0xf36c9fea: jl     0xf36c9fd0
  0xf36c9fec: add    $0x8,%esp
  0xf36c9fef: pop    %ebp
  0xf36c9ff0: test   %eax,0xf7725000    ;   {poll_return}
  0xf36c9ff6: ret    
  0xf36c9ff7: mov    $0xfffffff6,%ecx
  0xf36c9ffc: xchg   %ax,%ax
  0xf36c9fff: call   0xf36a56a0         ; OopMap{off=100}
                                        ;*putfield x
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@15 (line 79)
                                        ;   {runtime_call}
  0xf36ca004: call   0xf6f877a0         ;   {runtime_call}

The uncommented bar = vfoo (non-scalable, slower) version:

[Verified Entry Point]
  0xf3771aac: mov    %eax,-0x3000(%esp)
  0xf3771ab3: push   %ebp
  0xf3771ab4: sub    $0x8,%esp
  0xf3771aba: mov    0x68(%ecx),%ebx
  0xf3771abd: test   %ebx,%ebx
  0xf3771abf: jle    0xf3771afe
  0xf3771ac1: xor    %ebx,%ebx
  0xf3771ac3: nopw   0x0(%eax,%eax,1)
  0xf3771acc: xchg   %ax,%ax
  0xf3771ad0: mov    0x6c(%ecx),%ebp
  0xf3771ad3: test   %ebp,%ebp
  0xf3771ad5: je     0xf3771b09
  0xf3771ad7: movl   $0x1,0x8(%ebp)

-------------------------------------------------

  0xf3771ade: mov    0x6c(%ecx),%ebp
  0xf3771ae1: mov    %ebp,0x70(%ecx)
  0xf3771ae4: mov    0x68(%ecx),%edi
  0xf3771ae7: inc    %ebx
  0xf3771ae8: mov    %ecx,%eax
  0xf3771aea: shr    $0x9,%eax
  0xf3771aed: movb   $0x0,-0x3113c300(%eax)  ; OopMap{ecx=Oop off=84}
                                        ;*goto
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@29 (line 83)

-----------------------------------------------

  0xf3771af4: test   %edi,0xf77ce000    ;   {poll}
  0xf3771afa: cmp    %edi,%ebx
  0xf3771afc: jl     0xf3771ad0
  0xf3771afe: add    $0x8,%esp
  0xf3771b01: pop    %ebp
  0xf3771b02: test   %eax,0xf77ce000    ;   {poll_return}
  0xf3771b08: ret    
  0xf3771b09: mov    $0xfffffff6,%ecx
  0xf3771b0e: nop    
  0xf3771b0f: call   0xf374e6a0         ; OopMap{off=116}
                                        ;*putfield x
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@15 (line 79)
                                        ;   {runtime_call}
  0xf3771b14: call   0xf70307a0         ;   {runtime_call}

The differences in the two versions are within ---------. I expected to find synchronization instructions in the assembly which might account for the performance issue – while few extra shift, mov and inc instructions might affect absolute performance numbers, I don’t see how they could affect scalability.

So, I suspect that this is some sort of a memory issue related to storing to a field in the class. On the other hand, I’m also inclined to believe that the JIT does something funny, because in one iteration the measured time is twice as fast, as it should be.

Can anyone explain what is going on here?
Please be precise and include references that support your claims.

Thank you!

EDIT:

Here’s the bytecode for the fast (scalable) version:

public void run();
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 83: 18
   line 85: 24



  Code:
   Stack=2, Locals=2, Args_size=1
   0:   iconst_0
   1:   istore_1
   2:   iload_1
   3:   aload_0
   4:   getfield    #7; //Field sz:I
   7:   if_icmpge   24
   10:  aload_0
   11:  getfield    #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   14:  iconst_1
   15:  putfield    #8; //Field org/scalapool/bench/MultiVolatileJavaExperiment$Foo.x:I
   18:  iinc    1, 1
   21:  goto    2
   24:  return
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 83: 18
   line 85: 24

  StackMapTable: number_of_entries = 2
   frame_type = 252 /* append */
     offset_delta = 2
     locals = [ int ]
   frame_type = 21 /* same */

The slow (non-scalable) version with bar = vfoo:

public void run();
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 82: 18
   line 83: 26
   line 85: 32



  Code:
   Stack=2, Locals=2, Args_size=1
   0:   iconst_0
   1:   istore_1
   2:   iload_1
   3:   aload_0
   4:   getfield    #7; //Field sz:I
   7:   if_icmpge   32
   10:  aload_0
   11:  getfield    #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   14:  iconst_1
   15:  putfield    #8; //Field org/scalapool/bench/MultiVolatileJavaExperiment$Foo.x:I
   18:  aload_0
   19:  aload_0
   20:  getfield    #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   23:  putfield    #6; //Field bar:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   26:  iinc    1, 1
   29:  goto    2
   32:  return
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 82: 18
   line 83: 26
   line 85: 32

  StackMapTable: number_of_entries = 2
   frame_type = 252 /* append */
     offset_delta = 2
     locals = [ int ]
   frame_type = 29 /* same */

The more I am experimenting with this, it seems to me that this has nothing to do with volatiles at all – it has something to do with writing to object fields. My hunch is that this is somehow a memory contention issue – something with caches and false sharing, although there is no explicit synchronization at all.

EDIT 2:

Interestingly, changing the program like this:

final class Holder {
    public Foo bar = null;
}

final class Reader extends Thread {
    volatile Foo vfoo = new Foo();
    Holder holder = null;
    int sz;

    public Reader(int _sz) {
        sz = _sz;
    }

    public void run() {
        int i = 0;
        holder = new Holder();
        while (i < sz) {
            vfoo.x = 1;
            holder.bar = vfoo;
            i++;
        }
    }
}

resolves the scaling issue. Apparently, the Holder object above gets created after the thread is started, and is probably allocated in a different segment of memory, which is then being modified concurrently, as opposed to modifying the field bar in the thread object, which is somehow “close” in memory between different thread instances.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T22:44:50+00:00Added an answer on June 6, 2026 at 10:44 pm

    Short: apparently, the answer is false sharing due to card marking for the GC.

    A more extensive explanations is given in this question:

    Array allocation and access on the Java Virtual Machine and memory contention

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Observe the following definition of a thread subclass (the entire runnable Java source file
I have written a Hello, world program with JNI. Java calls c program with
Let's observe the following segment of code in Java that uses System.currentTimeMillis() in a
Please, observe the following simple program (based on the example from the protobuf-net project
I'm new to the package java.util.concurrent . I've created the following program testing if
Observe the following Python script, Script.py: import subprocess src_directory = 'Z:\z_7z\Some_Directory' zip_file_name = 'Test.7z'
I observe the following strange behaviour in Vim 7.3: When I'm below a fold,
Observe the following piece of an msbuild script: <ItemGroup> <R Include=-Microsoft.Design#CA1000 /> <R Include=-Microsoft.Design#CA1002
Observe the following simple source code: using System; using System.Linq.Expressions; using System.Reflection; using System.Reflection.Emit;
Why does this happen? Please observe the following code: static class StringExtension { public

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.