This is just a general question relating to some high-performance computing I’ve been wondering about. A certain low-latency messaging vendor speaks in its supporting documentation about using raw sockets to transfer the data directly from the network device to the user application and in so doing it speaks about reducing the messaging latency even further than it does anyway (in other admittedly carefully thought-out design decisions).
My question is therefore to those that grok the networking stacks on Unix or Unix-like systems. How much difference are they likely to be able to realise using this method? Feel free to answer in terms of memory copies, numbers of whales rescued or areas the size of Wales 😉
Their messaging is UDP-based, as I understand it, so there’s no problem with establishing TCP connections etc. Any other points of interest on this topic would be gratefully thought about!
Best wishes,
Mike
There are some pictures http://vger.kernel.org/~davem/tcp_output.html
Googled with
tcp_transmit_skb()which is a key part of tcp datapath. There are some more interesting thing on his site http://vger.kernel.org/~davem/In
user - tcptransmit part of datapath there is 1 copy from user to skb withskb_copy_to_page(when sending bytcp_sendmsg()) and 0 copy withdo_tcp_sendpages(called bytcp_sendpage()). Copy is needed to keep a backup of data for case of undelivered segment. skb buffers in kernel can be cloned, but their data will stay in first (original) skb. Sendpage can take a page from other kernel part and keep it for backup (i think there is smth like COW)Call paths (manually from lxr). Sending
tcp_push_one/__tcp_push_pending_framesReceive
tcp_recv_skb()In receive there can be 1 copy from kernel to user
skb_copy_datagram_iovec(called fromtcp_recvmsg). And for tcp_read_sock() there can be copy. It will callsk_read_actorcallback function. If it correspond to file or memory, it may (or may not) copy data from DMA zone. If it is a other network, it has an skb of received packet and can reuse its data inplace.For udp – receive = 1 copy — skb_copy_datagram_iovec called from udp_recvmsg. transmit = 1 copy — udp_sendmsg -> ip_append_data -> getfrag (seems to be ip_generic_getfrag with 1 copy from user, but may be a smth sendpage/splicelike without page copiing.)
Generically speaking there is must be at least 1 copy when sending from/receiving to userspace and 0 copy when using zero-copy (surprise!) with kernel-space source/target buffers for data. All headers are added without moving a packet, DMA-enabled (all modern) network card will take data from any place in DMA-enabled address space. For ancient cards PIO is needed, so there will be one more copy, from kernel space to PCI/ISA/smthelse I/O registers/memory.
UPD: In path from NIC (but this is nic-dependent, i checked 8139too) to tcp stack there is one more copy: from rx_ring to skb and the same for receive: from skb to tx buffer +1copy. You must to fill in ip and tcp header, but does skb contain them or place for them?