I have 2 servers which have different specifications but they both run the same application.
Server 1 is a Hyper-V with 2 x 2.4 Ghz and sever is a VPS and has 2 x Intel Xeon CPU E5540 2.53Ghz.
I have a generic handler which takes some stuff from a form and processes some data on a list of some objects in a parallel way using Parallel.For. I use default MaxDegreeOfParallelism. Nothing strange.
But… When I enabled some logging to figure out why the second server is better (faster) at doing the same thing that the first server is doing, the results are inconsistent with what you normally think reality should be like.
The “problem” is, I have logs from server 1 which looks like this (excerpt):
ÖVERKALIX -> table.Select [1]: 78.125 ms doubles.AddRange: 0 ms
results [0]: 0 msÖVERKALIX -> table.Select [1]: 62.5 ms doubles.AddRange: 0 ms results
[0]: 0 msÖVERTORNEÅ -> table.Select [1]: 62.5 ms doubles.AddRange: 0 ms results
[0]: 0 msÖVERTORNEÅ -> table.Select [1]: 78.125 ms doubles.AddRange: 0 ms
results [0]: 0 msTotal servertid att exekvera 592 frågor: 20062.5 ms
And a log from the second one like this (excerpt):
ÖVERKALIX -> table.Select [1]: 99 ms doubles.AddRange: 0 ms results
[0]: 0 msÖVERKALIX -> table.Select [1]: 103 ms doubles.AddRange: 0 ms results
[0]: 0 msÖVERTORNEÅ -> table.Select [1]: 100 ms doubles.AddRange: 0 ms results
[0]: 0 msÖVERTORNEÅ -> table.Select [1]: 104 ms doubles.AddRange: 0 ms results
[0]: 0 msTotal servertid att exekvera 592 frågor: 4479 ms
If you look at it you see that something is strange here. The first server is executing all individual queries faster than the second server, but the total time of all queries is more than the second server…
WHY?
What you would normally think is that, if there are n operations to be done, and each operation takes t ms, then, the total time of the operations should NEVER be more than if you have n operations of which each operation takes (for example) (t + 1) ms.
But anyway, what we have here is the logs saying that it is true that t > (t + 1). I´m disappointed! Well, I´m no expert, but that´s impossible 🙂
So, what are your thoughts on this?
Is it due to some hyperthreading stuff?
Is it because it takes more time to spawn a new thread on the first server (this seems like the most reasonable answer)?
If it is due to thread-creation problems, is there any way I can measure this?
UPDATE:
I have looked deeper into the problem and a pattern emerges. Here is some data from server 1 (time in ms):
78.125
187.5
78.125
93.75
750
62.5
62.5
62.5
78.125
46.875
78.125
46.875
1203.125
62.5
1125
78.125
2500
62.5
46.875
78.125
62.5
62.5
1484.375
62.5
62.5
1437.5
62.5
78.125
And here are the same queries executed on server 2:
104
104
156
116
117
116
114
115
112
107
110
112
164
128
128
124
112
111
99
104
109
105
241
115
116
115
112
112
As you can see, server 1 is faster, but occasionally (like the values: 1203.125, 1484.375 and 2500) it takes alot more time than server 2.
So, it seems that server 1 is faster on a small set of queries and server 2 is faster (smoother?) or a large set of queries?
Can any conclusions be made from these numbers?
Why do we see these differences?
Thanks in advance!
There are SO many things here that could be going on.
First off, I would have expected server 2 to be faster.. It has faster processors after all.
Regardless:
You mention that both servers are running your app; however both are also virtual machines.
What else is running on those physical boxes? or even What else is running within the virtual machines?
It could literally be anything. Maybe server 1 also has a VM that runs a scheduled job every so often that hogs your resources… Maybe server 1 has a completely different disk array whose write cache can’t keep up with the demand and has to pause every so often to flush?
Maybe server 1’s NIC gets overloaded with inbound/outbound data, again caused by some type of scheduled job. Maybe Bob, the ever helpful sys admin, likes to log into server 1 to have it download his totally legal msdn software.
Point is, no one is going to be able to tell you what’s happening as there are WAY too many variables involved.
Where I’d start:
Ensure NOTHING else is running on Server 1 except for your VM and your app. I mean absolutely nothing, no scheduled jobs, no other applications, nothing. Do the same for Server 2.
Profile. What’s going on with the CPU, Disk and Memory. Is server 1 having to page memory to/from disk? In other words, does it have enough RAM to keep your app and all it’s data in memory without having to flush it? What about Server 2?
If you are doing disk reads, what are the drive characteristics between the two machines. You could have radically different performance on nearly identical machines where literally the only difference is that one has a 15k RPM SCSI drives in a RAID 0 config and the other has a single 5400RPM PATA drive.
Did I mention profiling? Where is the pause occurring, what is the state of the physical hardware at the time of the pause. Are you processing identical data on each box?
Decide whether it matters. This should probably be number 1. After all, you have different hardware ergo, you should expect different performance results. Probably the only thing that really matters is that Server 1 sometimes experiences a pause. In that case ignore server 2 completely and profile server 1.