I am looking for a CPU profiler for .NET supporting
- heavily threaded apps.
- (double) quad-core CPUs.
- sampling profiling.
- 64 bits OS.
- command-line API.
Currently, I am getting trouble with most .NET profilers, in particular,
- YourKit does not seem to support command-line.
- dotTrace 3.1 is crashing with 64 bits OS.
I haven’t tried Intel VTune so far. Any enlightened suggestion?
Threading tends to confuse most profilers in that they aren’t sure how to count time that one thread is blocked waiting on another and time a thread is waiting around for work (if you aren’t using the thread pool).
There are two profilers that you should check out in our experience: 1. Eqatech has a free profiler that does code injection for profiling. It only works down to the method level, but does work on 64 bit and can give you a sense of where your time is going. 2. The Red-Gate ANT profiler is the best we’ve used for drilling into line level timings which is critical if you have a scenario where threads are blocked waiting on each other for a significant part of their execution time.
That said, if you can run your tests without a user interface I highly recommend setting up some NUnit performance tests so you can quickly and easily repeat them exactly the same way each time. Doing this you can see where your time is going even with a method level profiler, propose and check individual optimizations, etc. You’ll be surprised how taking time from one place just adds it to another because the threads end up waiting on each other, so having a perfectly repeatable test is very useful.
The Eqatech profiler modifies your compiled IL so whenever you re-run that IL you get a new performance tracing file, so it should fit your command line scenario. I believe ANT can work similarly.
Finally, if your application is going to be used in multiple environments I’d definitely recommend running a few performance test runs on a single processor or hyperthread only processor. When you get very aggressive about tuning for multicores you can end up adding enough overhead that you may deadlock or just be slow on small numbers of processors.