My application takes a checkpoint every few 100 milliseconds by using the fork system call. However, I notice that my application slows down significantly when using checkpointing (forking). I tested the time taken by fork call and it came out to be 1 to 2 ms. So why is fork slowing down my application so much. Note that I only keep 1 checkpoint (forked process) at a time and kill the previous checkpoint whenever I take a new one. Also, my computer has a huge RAM.
Notice that my forked process just sleeps after creation. It is only awoken when rollback needs to be done. So, it should not be scheduled by the OS. One thing that comes to my mind is that since fork is a copy-on-write mechanism, there are page faults occuring whenever my application modifies a page. But should that slow down the application significantly? Without checkpointing (forking), my application finishes in approximately 3.1 seconds and with it, it takes around 3.7 seconds. Any idea, what is slowing down my application?
You are probably observing the cost of the copy-on-write mechanism, as you hypothesize. That’s actually quite expensive — it is the reason
vforkstill exists. (The main cost is not the extra page faults themselves, but thememcpyof each page as it is touched, and the associated cache and TLB flushes.) It’s not showing up as a cost offorkbecause the page faults don’t happen inside the system call.You can confirm the hypothesis by looking at the times reported by
getrusage— if this is correct, the extra time elapsed should be nearly all “system” time (CPU burnt inside the kernel).oprofileorperfwill let you pin down the problem more specifically… if you can get them to work at all, which is nontrivial, alas.Unfortunately, copy-on-write is also the reason why your checkpoint mechanism works in the first place. Can you get away with taking checkpoints at longer intervals? That’s the only quick fix I can think of.