Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9176341
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T17:05:25+00:00 2026-06-17T17:05:25+00:00

When using kernel objects to synchronize threads running on different CPUs, is there perhaps

  • 0

When using kernel objects to synchronize threads running on different CPUs, is there perhaps some extra runtime cost when using Windows Server 2008 R2 relative to other OS’s?

Edit: And as found out via the answer, the question should also include the phrase, “when running at lower CPU utilization levels.” I included more information in my own answer to this question.

Background

I work on a product that uses shared memory and semaphores for communication between processes (when the two processes are running on the same machine). Reports of performance problems on Windows Server 2008 R2 (which I shorten to Win2008R2 after this) led me to find that sharing a semaphore between two threads on Win2008R2 was relatively slow compared to other OS’s.

Reproducing it

I was able to reproduce it by running the following bit of code concurrently on two threads:

for ( i = 0; i < N; i++ )
  {
  WaitForSingleObject( globalSem, INFINITE );
  ReleaseSemaphore( globalSem, 1, NULL );
  }

Testing with a machine that would dual boot into Windows Server 2003 R2 SP2 and Windows Server 2008 R2, the above snippet would run about 7 times faster on the Win2003R2 machine versus the Win2008R2 (3 seconds for Win2003R2 and 21 seconds for Win2008R2).

Simple Version of the Test

The following is the full version of the aforementioned test:

#include <windows.h>
#include <stdio.h>
#include <time.h>


HANDLE gSema4;
int    gIterations = 1000000;

DWORD WINAPI testthread( LPVOID tn )
{
   int count = gIterations;

   while ( count-- )
      {
      WaitForSingleObject( gSema4, INFINITE );
      ReleaseSemaphore( gSema4, 1, NULL );
      }

   return 0;
}


int main( int argc, char* argv[] )
{
   DWORD    threadId;
   clock_t  ct;
   HANDLE   threads[2];

   gSema4 = CreateSemaphore( NULL, 1, 1, NULL );

   ct = clock();
   threads[0] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );
   threads[1] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );

   WaitForMultipleObjects( 2, threads, TRUE, INFINITE );

   printf( "Total time = %d\n", clock() - ct );

   CloseHandle( gSema4 );
   return 0;
}

More Details

I updated the test to enforce the threads to run a single iteration and force a switch to the next thread at each loop. Each thread signals the next thread to run at the end of each loop (round-robin style). And I also updated it to use a spinlock as an alternative to the semaphore (which is a kernel object).

All machines I tested on were 64-bit machines. I compiled the test mostly as 32-bit. If built as 64-bit, it ran a bit faster overall and changed the ratios some, but the final result was the same. In addition to Win2008R2, I also ran against Windows 7 Enterprise SP 1, Windows Server 2003 R2 Standard SP 2, Windows Server 2008 (not R2), and Windows Server 2012 Standard.

  • Running the test on a single CPU was significantly faster (“forced” by setting thread affinity with SetThreadAffinityMask and checked with GetCurrentProcessorNumber). Not surprisingly, it was faster on all OS’s when using a single CPU, but the ratio between multi-cpu and single cpu with the kernel object synchronization was much higher on Win2008R2. The typical ratio for all machines except Win2008R2 was 2x to 4x (running on multiple CPUs took 2 to 4 times longer). But on Win2008R2, the ratio was 9x.
  • However … I was not able to reproduce the slowdown on all Win2008R2 machines. I tested on 4, and it showed up on 3 of them. So I cannot help but wonder if there is some kind of configuration setting or performance tuning option that might affect this. I have read performance tuning guides, looked through various settings, and changed various settings (e.g., background service vs foreground app) with no difference in behavior.
  • It does not seem to be necessarily tied to switching between physical cores. I originally suspected that it was somehow tied to the cost of accessing global data on different cores repeatedly. But when running a version of the test that uses a simple spinlock for synchronization (not a kernel object), running the individual threads on different CPUs was reasonably fast on all OS types. The ratio of the multi-cpu semaphore sync test vs multi-cpu spinlock test was typically 10x to 15x. But for the Win2008R2 Standard Edition machines, the ratio was 30x.

Here are some actual numbers from the updated test (times are in milliseconds):

+----------------+-----------+---------------+----------------+
|       OS       | 2 cpu sem |   1 cpu sem   | 2 cpu spinlock |
+----------------+-----------+---------------+----------------+
| Windows 7      | 7115 ms   | 1960 ms (3.6) | 504 ms (14.1)  |
| Server 2008 R2 | 20640 ms  | 2263 ms (9.1) | 866 ms (23.8)  |
| Server 2003    | 3570 ms   | 1766 ms (2.0) | 452 ms (7.9)   |
+----------------+-----------+---------------+----------------+

Each of the 2 threads in the test ran 1 million iterations. Those testes were all run on identical machines. The Win Server 2008 and Server 2003 numbers are from a dual boot machine. The Win 7 machine has the exact same specs but was a different physical machine. The machine in this case is a Lenovo T420 laptop with Core i5-2520M 2.5GHz. Obviously not a server class machine, but I get similar result on true server class hardware. The numbers in parentheses are the ratio of the first column to the given column.

Any explanation for why this one OS would seem to introduce extra expense for kernel level synchronization across CPUs? Or do you know of some configuration/tuning parameter that might affect this?

While it would make this exceedingly verbose and long post longer, I could post the enhanced version of the test code that the above numbers came from if anyone wants it. That would show the enforcement of the round-robin logic and the spinlock version of the test.

Extended Background

To try to answer some of the inevitable questions about why things are done this way. And I’m the same … when I read a post, I often wonder why I am even asking. So here are some attempts clarify:

  • What is the application? It is a database server. In some situations, customers run the client application on the same machine as the server. In that case, it is faster to use shared memory for communications (versus sockets). This question is related to the shared memory comm.
  • Is the workload really that dependent on events? Well … the shared memory comm is implemented using named semaphores. The client signals a semaphore, the server reads the data, the server signals a semaphore for the client when the response is ready. In other platforms, it is blinding fast. On Win2008R2, it is not. It is also very dependent on the customer application. If they write it with lots of small requests to the server, then there is a lot of communication between the two processes.
  • Can a lightweight lock be used? Possibly. I am already looking at that. But it is independent of the original question.
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T17:05:26+00:00Added an answer on June 17, 2026 at 5:05 pm

    Pulled from the comments into an answer:

    Maybe the server is not set to the high-performance power plan? Win2k8 might have a different default. Many servers aren’t by default, and this hits performance very hard.

    The OP confirmed this as the root cause.

    This is a funny cause for this behavior. The idea flashed up in my head while I was doing something completely different.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

While trying to compile a 64 bit linux kernel using gcc, I see the
I'm trying to develop a simple kernel using TASM, using this code: ; beroset.asm
I tried using make defconfig to compile the kernel, but as expected, it failed
I am using Linux 2.6.26 kernel version and I am trying to change the
I am writing a CUDA kernel in which I'm using the string data type
I am using Linux Mint 12, Kernel 3. I have installed and configured Open
I have been reading about the kernel using timers for thread synchronization. I haven't
I'm using this ninject bind: kernel.Bind<ICurrentUser>().To<CurrentUser>() .InRequestScope() .WithConstructorArgument(principal, context => (RolePrincipal HttpContext.Current.User); In one
I am using Ninject for DI. I have Ninject Modules that bind some services
I'm using RVM v1.10.2, ruby v1.9.3p0, and ruby gems v1.8.15 on MacOS Lion (kernel

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.