I have a bug in a multi-processes program. The program receives input and instantly produces output, no network involved, and it doesn’t have any time references.
What makes the cause of this bug hard to track down is that it only happens sometimes.
If I constantly run it, it produces both correct and incorrect output, with no discernible order or pattern.
What can cause such non-deterministic behavior? Are there tools out there that can help? There is a possibility that there are uninitialized variables in play. How do I find those?
EDIT: Problem solved, thanks for anyone who suggested
Race Condition.
I didn’t thought of it mainly because I was sure that my design prevents this. The problem was that I’ve used ‘wait’ instead of ‘waitpid’, thus sometimes, when some process was lucky enough to finish before the one I was expecting, the correct order of things went wild.
The scheduler!
Basically, when you have multiple processes, they can run in any bizarre order they want. If those processes are sharing a resource that they are both reading and writing from (whether it be a file or memory or an IO device of some sort), ops are going to get interleaved in all sorts of weird orders. As a simple example, suppose you have two threads (they’re threads so they share memory) and they’re both trying to increment a global variable, x.
Now run those processes, but interleave the code in this way
Assume x = 1
P1:
So now in P1, for variable y which is local and on the stack,
y = 2. Then the scheduler comes in and starts P2P2:
x was still 1 coming into this, so 1 has been added to it and now
x = 2Then P1 finishes
P1:
and x is still 2! We incremented x twice but only got that once. And because we don’t know how this is going to happen, it’s referred to as non-deterministic behavior.
The good news is, you’ve stumbled upon one of the hardest problems in Systems programming as well as the primary battle cry of many of the functional language folks.