I hit a bug in my code which uses WSARecv and WSAGetOverlapped result on an overlapped socket. Under heavy load, WSAGetOverlapped returns with WSASYSCALLFAILURE (‘A system call that should never fail has failed’) and my TCP stream is out of sync afterwards, causing mayhem in the upper levels of my program.
So far I have not been able to isolate it to a given set of hardware or drivers. Has somebody hit this issue as well, and found a solution or workaround?
It seems I have solved this issue by sleeping 1ms and retrying the WSAGetOverlapped result when it reports a WSASYSCALLFAILURE.
I had another issue related to overlapped events firing, even though there is no data, which I also had to solve first. The test is now running for over an hour, with a few WSASYSCALLFAILURE handled correctly. Hopefully the overnight test will succeed as well.
@Len: thanks again for your help.
EDIT: The overnight test was successful. My bug was caused by two interdependent issues:
Issue 1: WaitForMultipleObjects in ConnectionSet::select occasionally
signals data on an empty socket, causing SocketConnection::readSync to
deadlock.
Fix: Do a non-blocking read on the first byte of each packet. Reset
ConnectionSet if socket was empty
Issue 2: WSAGetOverlappedResult returns occasionally WSASYSCALLFAILURE,
causing out-of-sync on the TCP stream.
Fix: Retry WSAGetOverlappedResult after a small sleep period.
http://equalizer.svn.sourceforge.net/viewvc/equalizer?view=revision&revision=4649