I witnessed the following weird behavior. I have two functions, which do almost the same – they measure the number of cycles it takes to do a certain operation. In one function, inside the loop I increment a variable; in the other nothing happens. The variables are volatile so they won’t be optimized away. These are the functions:
unsigned int _osm_iterations=5000; double osm_operation_time(){ // volatile is used so that j will not be optimized, and ++ operation // will be done in each loop volatile unsigned int j=0; volatile unsigned int i; tsc_counter_t start_t, end_t; start_t = tsc_readCycles_C(); for (i=0; i<_osm_iterations; i++){ ++j; } end_t = tsc_readCycles_C(); if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t)) return -1; return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations; } double osm_empty_time(){ volatile unsigned int i; volatile unsigned int j=0; tsc_counter_t start_t, end_t; start_t = tsc_readCycles_C(); for (i=0; i<_osm_iterations; i++){ ; } end_t = tsc_readCycles_C(); if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t)) return -1; return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations; }
There are some non-standard functions there but I’m sure you’ll manage.
The thing is, the first function returns 4, while the second function (which supposedly does less) returns 6, although the second one obviously does less than the first one.
Does that make any sense to anyone?
Actually I made the first function so I could reduce the loop overhead for my measurement of the second. Do you have any idea how to do that (as this method doesn’t really cut it)?
I’m on Ubuntu (64 bit I think).
Thanks a lot.
I can see a couple of things here. One is that the code for the two loops looks identical. Secondly, the compiler will probably realise that the variable
iand the variablejwill always have the same value and optimise one of them away. You should look at the generated assembly and see what is really going on.Another theory is that the change to the inner body of the loop has affected the cachability of the code – this could have moved it across cache lines or some other thing.
Since the code is so trivial, you may find it difficult to get an accuate timing value, even if you are doing 5000 iterations, you may find that the time is inside the margin for error for the timing code you are using. A modern computer can probably run that in far less than a millisecond – perhaps you should increase the number of iterations?
To see the generated assembly in gcc, specify the -S compiler option: