I trying to get started with Google Perf Tools to profile some CPU intensive applications. It’s a statistical calculation that dumps each step to a file using `ofstream’. I’m not a C++ expert so I’m having troubling finding the bottleneck. My first pass gives results:
Total: 857 samples
357 41.7% 41.7% 357 41.7% _write$UNIX2003
134 15.6% 57.3% 134 15.6% _exp$fenv_access_off
109 12.7% 70.0% 276 32.2% scythe::dnorm
103 12.0% 82.0% 103 12.0% _log$fenv_access_off
58 6.8% 88.8% 58 6.8% scythe::const_matrix_forward_iterator::operator*
37 4.3% 93.1% 37 4.3% scythe::matrix_forward_iterator::operator*
15 1.8% 94.9% 47 5.5% std::transform
13 1.5% 96.4% 486 56.7% SliceStep::DoStep
10 1.2% 97.5% 10 1.2% 0x0002726c
5 0.6% 98.1% 5 0.6% 0x000271c7
5 0.6% 98.7% 5 0.6% _write$NOCANCEL$UNIX2003
This is surprising, since all the real calculation occurs in SliceStep::DoStep. The “_write$UNIX2003” (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003′ goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.
What am I missing?
Added:
The pprof output without the outfile << statements is:
Total: 790 samples
205 25.9% 25.9% 205 25.9% _exp$fenv_access_off
170 21.5% 47.5% 170 21.5% _log$fenv_access_off
162 20.5% 68.0% 437 55.3% scythe::dnorm
83 10.5% 78.5% 83 10.5% scythe::const_matrix_forward_iterator::operator*
70 8.9% 87.3% 70 8.9% scythe::matrix_forward_iterator::operator*
28 3.5% 90.9% 78 9.9% std::transform
26 3.3% 94.2% 26 3.3% 0x00027262
12 1.5% 95.7% 12 1.5% _write$NOCANCEL$UNIX2003
11 1.4% 97.1% 764 96.7% SliceStep::DoStep
9 1.1% 98.2% 9 1.1% 0x00027253
6 0.8% 99.0% 6 0.8% 0x000274a6
This looks like what I’d expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:
ofstream outfile("out.txt");
for loop:
SliceStep::DoStep()
outfile << 'result'
outfile.close()
Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.
From my comments:
The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.
The runtime, however, stays nearly the same.
Obviously one of the measurements must be wrong. That means you have to do more and better measurements.
First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.
If the results are still not conclusive you need a better testcase:
That means the profiler is wrong.
Printing 100000 lines to the console using python results in something like:
To console:
Versus:
My point is:
Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who’s wrong?