I have a Linux C++ application which receives sequenced UDP packets. Because of the sequencing, I can easily determine when a packet is lost or re-ordered, i.e. when a “gap” is encountered. The system has a recovery mechanism to handle gaps, however, it is best to avoid gaps in the first place. Using a simple libpcap-based packet sniffer, I have determined that there are no gaps in the data at the hardware level. However, I am seeing a lot of gaps in my application. This suggests the kernel is dropping packets; it is confirmed by looking at the /proc/net/snmp file. When my application encounters a gap, the Udp InErrors counter increases.
At the system level, we have increased the max receive buffer:
# sysctl net.core.rmem_max
net.core.rmem_max = 33554432
At the application level, we have increased the receive buffer size:
int sockbufsize = 33554432
int ret = setsockopt(my_socket_fd, SOL_SOCKET, SO_RCVBUF,
(char *)&sockbufsize, (int)sizeof(sockbufsize));
// check return code
sockbufsize = 0;
ret = getsockopt(my_socket_fd, SOL_SOCKET, SO_RCVBUF,
(char*)&sockbufsize, &size);
// print sockbufsize
After the call to getsockopt(), the printed value is always 2x what it is set to (67108864 in the example above), but I believe that is to be expected.
I know that failure to consume data quickly enough can result in packet loss. However, all this application does is check the sequencing, then push the data into a queue; the actual processing is done in another thread. Furthermore, the machine is modern (dual Xeon X5560, 8 GB RAM) and very lightly loaded. We have literally dozens of identical applications receiving data at a much higher rate that do not experience this problem.
Besides a too-slow consuming application, are there other reasons why the Linux kernel might drop UDP packets?
FWIW, this is on CentOS 4, with kernel 2.6.9-89.0.25.ELlargesmp.
I had a similar problem with my program. Its task is to receive udp packets in one thread and, using a blocking queue, write them to the database with another thread.
I noticed (using
vmstat 1) that when the system was experiencing heavy I/O wait operations (reads) my application didn’t receive packets, they were being received by the system though.The problem was – when heavy I/O wait occured, the thread that was writing to the database was being I/O starved while holding the queue mutex. This way the udp buffer was being overflown by incoming packets, because main thread that was receiving them was hanging on the
pthred_mutex_lock().I resolved it by playing with ioniceness (
ionicecommand) of my process and the database process. Changing I/O sched class to Best Effort helped. Surprisingly I’m not able to reproduce this problem now even with default I/O niceness.My kernel is 2.6.32-71.el6.x86_64.
I’m still developing this app so I’ll try to update my post once I know more.