On my machine Time A and Time B swap depending on whether A is
defined or not (which changes the order in which the two callocs are called).
I initially attributed this to the paging system. Weirdly, when
mmap is used instead of calloc, the situation is even more bizzare — both the loops take the same amount of time, as expected. As
can be seen with strace, the callocs ultimately result in two
mmaps, so there is no return-already-allocated-memory magic going on.
I’m running Debian testing on an Intel i7.
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#define SIZE 500002816
#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE, \
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif
int main() {
clock_t start, finish;
#ifdef A
int *arr1 = ALLOC(sizeof(int), SIZE);
int *arr2 = ALLOC(sizeof(int), SIZE);
#else
int *arr2 = ALLOC(sizeof(int), SIZE);
int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
int i;
start = clock();
{
for (i = 0; i < SIZE; i++)
arr1[i] = (i + 13) * 5;
}
finish = clock();
printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
start = clock();
{
for (i = 0; i < SIZE; i++)
arr2[i] = (i + 13) * 5;
}
finish = clock();
printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
return 0;
}
The output I get:
~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.94
Time B: 0.34
~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.34
Time B: 0.90
~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.89
Time B: 0.90
~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.91
Time B: 0.92
Short Answer
The first time that
callocis called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned frommmapis already zeroed out.Details
Here’s some of the things that I checked to come to this conclusion that you could try yourself if you wanted:
Insert a
calloccall before your firstALLOCcall. You will see that after this the Time for Time A and Time B are the same.Use the
clock()function to check how long each of theALLOCcalls take. In the case where they are both usingcallocyou will see that the first call takes much longer than the second one.Use
timeto time the execution time of thecallocversion and theUSE_MMAPversion. When I did this I saw that the execution time forUSE_MMAPwas consistently slightly less.I ran with
strace -tt -Twhich shows both the time of when the system call was made and how long it took. Here is part of the output:Strace output:
You can see that the first
mmapcall took0.000014seconds, but that about1.5seconds elapsed before the next system call. Then the secondmmapcall took0.000021seconds, and was followed by thetimescall a few hundred microsecond later.I also stepped through part of the application execution with
gdband saw that the first call tocallocresulted in numerous calls tomemsetwhile the second call tocallocdid not make any calls tomemset. You can see the source code forcallochere (look for__libc_calloc) if you are interested. As for whycallocis doing thememseton the first call but not subsequent ones I don’t know. But I feel fairly confident that this explains the behavior you have asked about.As for why the array that was zeroed
memsethas improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array. Regardless the specific reason for the performance difference that you asked about is that the twocalloccalls behave differently when they are executed.