I have a really weird problem that is driving me crazy.
I have a Ruby server and a Flash client (Action Script 3). It’s a multiplayer game.
The problem is that everything is working perfect and then, suddenly, a random player stops receiving data. When the server closes the connection because of inactivity, about 20-60 seconds later, the client receives all the buffered data.
The client uses XMLsocket for retrieving data, so the way the client receives data is not the problem.
socket.addEventListener(Event.CONNECT, connectHandler);
function connectHandler(event)
{
sendData(sess);
}
function sendData(dat)
{
trace("SEND: " + dat);
addDebugData("SEND: " + dat)
if (socket.connected) {
socket.send(dat);
} else {
addDebugData("SOCKET NOT CONNECTED")
}
}
socket.addEventListener(DataEvent.DATA, dataHandler);
function dataHandler(e:DataEvent) {
var data:String = e.data;
workData(data);
}
The server flushes data after every write, so is not a flushing problem:
sock.write(data + DATAEOF)
sock.flush()
DATAEOF is null char, so the client parses the string.
When the server accepts a new socket, it sets sync to true, to autoflush, and TCP_NODELAY to true too:
newsock = serverSocket.accept
newsock.sync = true
newsock.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, true)
This is my research:
Info: I was dumping netstat data to a file each second.
- When the client stops receiving data, netstat shows that socket status is still
ESTABLISHED. - Some seconds after that, send-queue grows accordingly to data sent.
- tcpflow shows that packets are sent 2 times.
- When the server closes the socket, socket status changes to
FIN_WAIT1, as expected. Then, tcpflow shows that all buffered data is sent to the client, but the client don’t receives data. some seconds after that, connection dissapears from netstat and tcpflow shows that the same data is sent again, but this time the client receives the data so starts sending data to the server and the server receives it. But it’s too late… server has closed connection.
I don’t think it’s an OS/network problem, because I’ve changed from a VPS located in Spain to Amazon EC2 located in Ireland and the problem still remains.
I don’t think it’s a client network problem too, because this occurs dozens of times per day, and the average quantity of online users is about 45-55, with about 400 unique users a day, so the ratio is extremely high.
EDIT:
I’ve done more research. I’ve changed the server to C++.
When a client stops sending data, after a while the server receives a “Connection reset by peer” error. In that moment, tcpdump shows me that the client sent a RST packet, this could be because the client closed the connection and the server tried to read, but… why the client closed the connection? I think the answer is that the client is not the one closing the connection, is the kernel. Here is some info: http://scie.nti.st/2008/3/14/amazon-s3-and-connection-reset-by-peer
Basically, as I understand it, Linux kernels 2.6.17+ increased the maximum size of the TCP window/buffer, and this started to cause other gear to wig out, if it couldn’t handle sufficiently large TCP windows. The gear would reset the connection, and we see this as a “Connection reset by peer” message.
I’ve followed the steps and now it seems that the server is closing connections only when the client losses its connection to internet.
I’m going to add this as an answer so people know a bit mroe about this.
I think the answer is that the kernel is the one closing the connection. Here is some info: http://scie.nti.st/2008/3/14/amazon-s3-and-connection-reset-by-peer
Basically, as I understand it, Linux kernels 2.6.17+ increased the maximum size of the TCP window/buffer, and this started to cause other gear to wig out, if it couldn’t handle sufficiently large TCP windows. The gear would reset the connection, and we see this as a “Connection reset by peer” message.I’ve followed the steps and now it seems that the server is closing connections only when the client losses its connection to internet.