When using kernel objects to synchronize threads running on different CPUs, is there perhaps

Question

0

Asked: June 17, 20262026-06-17T17:05:25+00:00 2026-06-17T17:05:25+00:00

When using kernel objects to synchronize threads running on different CPUs, is there perhaps

0

When using kernel objects to synchronize threads running on different CPUs, is there perhaps some extra runtime cost when using Windows Server 2008 R2 relative to other OS’s?

Edit: And as found out via the answer, the question should also include the phrase, “when running at lower CPU utilization levels.” I included more information in my own answer to this question.

Background

I work on a product that uses shared memory and semaphores for communication between processes (when the two processes are running on the same machine). Reports of performance problems on Windows Server 2008 R2 (which I shorten to Win2008R2 after this) led me to find that sharing a semaphore between two threads on Win2008R2 was relatively slow compared to other OS’s.

Reproducing it

I was able to reproduce it by running the following bit of code concurrently on two threads:

for ( i = 0; i < N; i++ )
  {
  WaitForSingleObject( globalSem, INFINITE );
  ReleaseSemaphore( globalSem, 1, NULL );
  }

Testing with a machine that would dual boot into Windows Server 2003 R2 SP2 and Windows Server 2008 R2, the above snippet would run about 7 times faster on the Win2003R2 machine versus the Win2008R2 (3 seconds for Win2003R2 and 21 seconds for Win2008R2).

Simple Version of the Test

The following is the full version of the aforementioned test:

#include <windows.h>
#include <stdio.h>
#include <time.h>


HANDLE gSema4;
int    gIterations = 1000000;

DWORD WINAPI testthread( LPVOID tn )
{
   int count = gIterations;

   while ( count-- )
      {
      WaitForSingleObject( gSema4, INFINITE );
      ReleaseSemaphore( gSema4, 1, NULL );
      }

   return 0;
}


int main( int argc, char* argv[] )
{
   DWORD    threadId;
   clock_t  ct;
   HANDLE   threads[2];

   gSema4 = CreateSemaphore( NULL, 1, 1, NULL );

   ct = clock();
   threads[0] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );
   threads[1] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );

   WaitForMultipleObjects( 2, threads, TRUE, INFINITE );

   printf( "Total time = %d\n", clock() - ct );

   CloseHandle( gSema4 );
   return 0;
}

More Details

I updated the test to enforce the threads to run a single iteration and force a switch to the next thread at each loop. Each thread signals the next thread to run at the end of each loop (round-robin style). And I also updated it to use a spinlock as an alternative to the semaphore (which is a kernel object).

All machines I tested on were 64-bit machines. I compiled the test mostly as 32-bit. If built as 64-bit, it ran a bit faster overall and changed the ratios some, but the final result was the same. In addition to Win2008R2, I also ran against Windows 7 Enterprise SP 1, Windows Server 2003 R2 Standard SP 2, Windows Server 2008 (not R2), and Windows Server 2012 Standard.

Running the test on a single CPU was significantly faster (“forced” by setting thread affinity with SetThreadAffinityMask and checked with GetCurrentProcessorNumber). Not surprisingly, it was faster on all OS’s when using a single CPU, but the ratio between multi-cpu and single cpu with the kernel object synchronization was much higher on Win2008R2. The typical ratio for all machines except Win2008R2 was 2x to 4x (running on multiple CPUs took 2 to 4 times longer). But on Win2008R2, the ratio was 9x.
However … I was not able to reproduce the slowdown on all Win2008R2 machines. I tested on 4, and it showed up on 3 of them. So I cannot help but wonder if there is some kind of configuration setting or performance tuning option that might affect this. I have read performance tuning guides, looked through various settings, and changed various settings (e.g., background service vs foreground app) with no difference in behavior.
It does not seem to be necessarily tied to switching between physical cores. I originally suspected that it was somehow tied to the cost of accessing global data on different cores repeatedly. But when running a version of the test that uses a simple spinlock for synchronization (not a kernel object), running the individual threads on different CPUs was reasonably fast on all OS types. The ratio of the multi-cpu semaphore sync test vs multi-cpu spinlock test was typically 10x to 15x. But for the Win2008R2 Standard Edition machines, the ratio was 30x.

Here are some actual numbers from the updated test (times are in milliseconds):

+----------------+-----------+---------------+----------------+
|       OS       | 2 cpu sem |   1 cpu sem   | 2 cpu spinlock |
+----------------+-----------+---------------+----------------+
| Windows 7      | 7115 ms   | 1960 ms (3.6) | 504 ms (14.1)  |
| Server 2008 R2 | 20640 ms  | 2263 ms (9.1) | 866 ms (23.8)  |
| Server 2003    | 3570 ms   | 1766 ms (2.0) | 452 ms (7.9)   |
+----------------+-----------+---------------+----------------+

Each of the 2 threads in the test ran 1 million iterations. Those testes were all run on identical machines. The Win Server 2008 and Server 2003 numbers are from a dual boot machine. The Win 7 machine has the exact same specs but was a different physical machine. The machine in this case is a Lenovo T420 laptop with Core i5-2520M 2.5GHz. Obviously not a server class machine, but I get similar result on true server class hardware. The numbers in parentheses are the ratio of the first column to the given column.

Any explanation for why this one OS would seem to introduce extra expense for kernel level synchronization across CPUs? Or do you know of some configuration/tuning parameter that might affect this?

While it would make this exceedingly verbose and long post longer, I could post the enhanced version of the test code that the above numbers came from if anyone wants it. That would show the enforcement of the round-robin logic and the spinlock version of the test.

Extended Background

To try to answer some of the inevitable questions about why things are done this way. And I’m the same … when I read a post, I often wonder why I am even asking. So here are some attempts clarify:

What is the application? It is a database server. In some situations, customers run the client application on the same machine as the server. In that case, it is faster to use shared memory for communications (versus sockets). This question is related to the shared memory comm.
Is the workload really that dependent on events? Well … the shared memory comm is implemented using named semaphores. The client signals a semaphore, the server reads the data, the server signals a semaphore for the client when the response is ready. In other platforms, it is blinding fast. On Win2008R2, it is not. It is also very dependent on the customer application. If they write it with lots of small requests to the server, then there is a lot of communication between the two processes.
Can a lightweight lock be used? Possibly. I am already looking at that. But it is independent of the original question.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T17:05:26+00:00

Editorial Team

2026-06-17T17:05:26+00:00Added an answer on June 17, 2026 at 5:05 pm

Pulled from the comments into an answer:

Maybe the server is not set to the high-performance power plan? Win2k8 might have a different default. Many servers aren’t by default, and this hits performance very hard.

The OP confirmed this as the root cause.

This is a funny cause for this behavior. The idea flashed up in my head while I was doing something completely different.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

When using kernel objects to synchronize threads running on different CPUs, is there perhaps

Background

Reproducing it

Simple Version of the Test

More Details

Extended Background

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply