I’m implementing pthread condition variables (based on Linux futexes) and I have an idea for avoiding the “stampede effect” on pthread_cond_broadcast with process-shared condition variables. For non-process-shared cond vars, futex requeue operations are traditionally (i.e. by NPTL) used to requeue waiters from the cond var’s futex to the mutex’s futex without waking them up, but this is in general impossible for process-shared cond vars, because pthread_cond_broadcast might not have a valid pointer to the associated mutex. In a worst case scenario, the mutex might not even be mapped in its memory space.
My idea for overcoming this issue is to have pthread_cond_broadcast only directly wake one waiter, and have that waiter perform the requeue operation when it wakes up, since it does have the needed pointer to the mutex.
Naturally there are a lot of ugly race conditions to consider if I pursue this approach, but if they can be overcome, are there any other reasons such an implementation would be invalid or undesirable? One potential issue I can think of that might not be able to be overcome is the race where the waiter (a separate process) responsible for the requeue gets killed before it can act, but it might be possible to overcome even this by putting the condvar futex in the robust mutex list so that the kernel performs a wake on it when the process dies.
There may be waiters belonging to multiple address spaces, each of which has mapped the mutex associated with the futex at a different address in memory. I’m not sure ifFUTEX_REQUEUEis safe to use when the requeue point may not be mapped at the same address in all waiters; if it does then this isn’t a problem.There are other problems that won’t be detected by robust futexes; for example, if your chosen waiter is busy in a signal handler, you could be kept waiting an arbitrarily long time. [As discussed in the comments, these are not an issue]
Note that with robust futexes, you must set the value of the futex
& 0x3FFFFFFFto be the TID of the thread to be woken up; you must also set bitFUTEX_WAITERSon if you want a wakeup. This means that you must choose which thread to awaken from the broadcasting thread, or you will be unable to deal with thread death immediately after theFUTEX_WAKE. You’ll also need to deal with the possibility of the thread dying immediately before the waker thread writes its TID into the state variable – perhaps having a ‘pending master’ field that is also registered in the robust mutex system would be a good idea.I see no reason why this can’t work, then, as long as you make sure to deal with the thread exit issues carefully. That said, it may be best to simply define in the kernel an extension to
FUTEX_WAITthat takes a requeue point and comparison value as an argument, and let the kernel handle this in a simple, race-free manner.