Under linux on the TCP Rx side, there are 3 queues, the PreQueue, ReceiveQueue, and BacklogQueue. Every time a valid TCP segment arrives its placed in one of these. According to goggle, linux is unique in having a PreQeueue as most TCP stacks only have Receive and Backlog where the theory behind PreQueue is. You do the processing in the sockets callee context/core/cpu instead of in the softirq / tasklet which could be on a different core or even cpu.
The above flow is shamelessly stolen from "The Performance Analysis of Linux Networking – Packet Receiving" by Wenji Wu, Matt Crawford of Fermilab. It was written in 2006 and im pretty sure its slightly wrong. As once an packet is put on the PreQueue no further processing is done in the softirq, where tcp_v4_do_rcv() + friends are processed in the callee context(right hand side). However its a nice high level overview.
As it happens you can disable the PreQueue entirely by setting /proc/sys/net/ipv4/tcp_low_latency to 1 and ... unfortunately need to mess with the kernel sauce a little to fully disable it. What does the latency look like?
TCP 128B A -> B -> A latency tcp_low_latency = 0
TCP 128B A->B->A latency tcp_low_latency = 1
As you can see... its basically the same... if your generous a tiny bit faster - not what we hopped for. The interesting question of course is, why is there no difference? Breaking the plots into tcp processing in the syscall callee context (tcp_low_latency=0) vs the softirq/tasklet context (tcp_low_latency=1) we can see most of the time is spent switching contexts, or more specifically waiting for the correct context to be scheduled.
TCP total (tcp_low_latency = 0)
TCP total (tcp_low_latency=1)
The plots above are a little counter intuitive. What its measuring is the time from tcp softirq start, to the end of tcp processing. So with tcp_low_latency=0, this includes the switch time from softirq -> callee context, and tcp_low_latency=1 everything is processed in the softirq. Thus low latency enabled gives a lower number and all is good ... but errr... it isnt. If we then look at the time from the end of tcp processing(in the kernel) to after recv() in user space we get the following.
TCP kernel end - > userspace tcp_low_latency=0
TCP kernel end -> userspace tcp_low_latency=1
... and funny enough its the mirror image. the low latency setting time is huge because iit includes the softirq->callee context switch, and with it disabled, its all ready in the callee context thus significantly less - just kernel->userspace switch. Thus explaining why the total round trip latency numbers are about the same (orange charts).
At first glance it appears most of our tcp Rx latency is the linux kernel scheduler? e.g. how long it takes to switch in the callee context - the one that called recv(). Which kind of sucks and unexpected, and raises the question of why UDP Rx is different... but we have plenty of tools test this hypothesis.