The following nunit test compares performance between running a single thread versus running 2 threads on a dual core machine. Specifically, this is a VMWare dual core virtual Windows 7 machine running on a quad core Linux SLED host with is a Dell Inspiron 503.
Each thread simply loops and increments 2 counters, addCounter and readCounter. This test was original testing a Queue implementation which was discovered to perform worse on a multi-core machine. So in narrowing down the problem to the small reproducible code, you have here no queue only incrementing variables and to shock and dismay, it’s far slower with 2 threads then one.
When running the first test, the Task Manager shows 1 of the cores 100% busy with the other core almost idle. Here’s the test output for the single thread test:
readCounter 360687000
readCounter2 0
total readCounter 360687000
addCounter 360687000
addCounter2 0
You see over 360 Million increments!
Next the dual thread test shows 100% busy on both cores for the whole 5 seconds duration of the test. However it’s output shows only:
readCounter 88687000
readCounter2 134606500
totoal readCounter 223293500
addCounter 88687000
addCounter2 67303250
addFailure0
That’s only 223 Million read increments. What is god’s creation are those 2 CPU’s doing for those 5 seconds to get less work done?
Any possible clue? And can you run the tests on your machine to see if you get different results? One idea is that perhaps the VMWare dual core performance isn’t what you would hope.
using System;
using System.Threading;
using NUnit.Framework;
namespace TickZoom.Utilities.TickZoom.Utilities
{
[TestFixture]
public class ActiveMultiQueueTest
{
private volatile bool stopThread = false;
private Exception threadException;
private long addCounter;
private long readCounter;
private long addCounter2;
private long readCounter2;
private long addFailureCounter;
[SetUp]
public void Setup()
{
stopThread = false;
addCounter = 0;
readCounter = 0;
addCounter2 = 0;
readCounter2 = 0;
}
[Test]
public void TestSingleCoreSpeed()
{
var speedThread = new Thread(SpeedTestLoop);
speedThread.Name = "1st Core Speed Test";
speedThread.Start();
Thread.Sleep(5000);
stopThread = true;
speedThread.Join();
if (threadException != null)
{
throw new Exception("Thread failed: ", threadException);
}
Console.Out.WriteLine("readCounter " + readCounter);
Console.Out.WriteLine("readCounter2 " + readCounter2);
Console.Out.WriteLine("total readCounter " + (readCounter + readCounter2));
Console.Out.WriteLine("addCounter " + addCounter);
Console.Out.WriteLine("addCounter2 " + addCounter2);
}
[Test]
public void TestDualCoreSpeed()
{
var speedThread1 = new Thread(SpeedTestLoop);
speedThread1.Name = "Speed Test 1";
var speedThread2 = new Thread(SpeedTestLoop2);
speedThread2.Name = "Speed Test 2";
speedThread1.Start();
speedThread2.Start();
Thread.Sleep(5000);
stopThread = true;
speedThread1.Join();
speedThread2.Join();
if (threadException != null)
{
throw new Exception("Thread failed: ", threadException);
}
Console.Out.WriteLine("readCounter " + readCounter);
Console.Out.WriteLine("readCounter2 " + readCounter2);
Console.Out.WriteLine("totoal readCounter " + (readCounter + readCounter2));
Console.Out.WriteLine("addCounter " + addCounter);
Console.Out.WriteLine("addCounter2 " + addCounter2);
Console.Out.WriteLine("addFailure" + addFailureCounter);
}
private void SpeedTestLoop()
{
try
{
while (!stopThread)
{
for (var i = 0; i < 500; i++)
{
++addCounter;
}
for (var i = 0; i < 500; i++)
{
readCounter++;
}
}
}
catch (Exception ex)
{
threadException = ex;
}
}
private void SpeedTestLoop2()
{
try
{
while (!stopThread)
{
for (var i = 0; i < 500; i++)
{
++addCounter2;
i++;
}
for (var i = 0; i < 500; i++)
{
readCounter2++;
}
}
}
catch (Exception ex)
{
threadException = ex;
}
}
}
}
Edit: I tested the above on a quad core laptop w/o vmware and got similar degraded performance. So I wrote another test similar to the above but which has each thread method in a separate class. My purpose in doing that was to test 4 cores.
Well that test showed excelled results which improved almost linearly with 1, 2, 3, or 4 cores.
With some experimentation now on both machines it appears that the proper performance only happens if main thread methods are on different instances instead of the same instance.
In other words, if multiple threads main entry method on on the same instance of a particular class, then the performance on a multi-core will be worse for each thread you add, instead of better as you might assume.
It almost appears that the CLR is “synchronizing” so only one thread at a time can run on that method. However, my testing says that isn’t the case. So it’s still unclear what’s happening.
But my own problem seems to be solved simply by making separate instances of methods to run threads as their starting point.
Sincerely,
Wayne
EDIT:
Here’s an updated unit test that tests 1, 2, 3, & 4 threads with them all on the same instance of a class. Using arrays with variables uses in the thread loop at least 10 elements apart. And performance still degrades significantly for each thread added.
using System;
using System.Threading;
using NUnit.Framework;
namespace TickZoom.Utilities.TickZoom.Utilities
{
[TestFixture]
public class MultiCoreSameClassTest
{
private ThreadTester threadTester;
public class ThreadTester
{
private Thread[] speedThread = new Thread[400];
private long[] addCounter = new long[400];
private long[] readCounter = new long[400];
private bool[] stopThread = new bool[400];
internal Exception threadException;
private int count;
public ThreadTester(int count)
{
for( var i=0; i<speedThread.Length; i+=10)
{
speedThread[i] = new Thread(SpeedTestLoop);
}
this.count = count;
}
public void Run()
{
for (var i = 0; i < count*10; i+=10)
{
speedThread[i].Start(i);
}
}
public void Stop()
{
for (var i = 0; i < stopThread.Length; i+=10 )
{
stopThread[i] = true;
}
for (var i = 0; i < count * 10; i += 10)
{
speedThread[i].Join();
}
if (threadException != null)
{
throw new Exception("Thread failed: ", threadException);
}
}
public void Output()
{
var readSum = 0L;
var addSum = 0L;
for (var i = 0; i < count; i++)
{
readSum += readCounter[i];
addSum += addCounter[i];
}
Console.Out.WriteLine("Thread readCounter " + readSum + ", addCounter " + addSum);
}
private void SpeedTestLoop(object indexarg)
{
var index = (int) indexarg;
try
{
while (!stopThread[index*10])
{
for (var i = 0; i < 500; i++)
{
++addCounter[index*10];
}
for (var i = 0; i < 500; i++)
{
++readCounter[index*10];
}
}
}
catch (Exception ex)
{
threadException = ex;
}
}
}
[SetUp]
public void Setup()
{
}
[Test]
public void SingleCoreTest()
{
TestCores(1);
}
[Test]
public void DualCoreTest()
{
TestCores(2);
}
[Test]
public void TriCoreTest()
{
TestCores(3);
}
[Test]
public void QuadCoreTest()
{
TestCores(4);
}
public void TestCores(int numCores)
{
threadTester = new ThreadTester(numCores);
threadTester.Run();
Thread.Sleep(5000);
threadTester.Stop();
threadTester.Output();
}
}
}
You’re probably running into cache contention — when a single CPU is incrementing your integer, it can do so in its own L1 cache, but as soon as two CPUs start “fighting” over the same value, the cache line it’s on has to be copied back and forth between their caches each time each one accesses it. The extra time spent copying data between caches adds up fast, especially when the operation you’re doing (incrementing an integer) is so trivial.