hacking NASDAQ @ 500 FPS: the myth of /proc/sys/net/ipv4/tcp_low

Under linux on the TCP Rx side, there are 3 queues, the PreQueue, ReceiveQueue, and BacklogQueue. Every time a valid TCP segment arrives its placed in one of these. According to goggle, linux is unique in having a PreQeueue as most TCP stacks only have Receive and Backlog where the theory behind PreQueue is. You do the processing in the sockets callee context/core/cpu instead of in the softirq / tasklet which could be on a different core or even cpu.

The above flow is shamelessly stolen from "The Performance Analysis of Linux Networking – Packet Receiving" by Wenji Wu, Matt Crawford of Fermilab. It was written in 2006 and im pretty sure its slightly wrong. As once an packet is put on the PreQueue no further processing is done in the softirq, where tcp_v4_do_rcv() + friends are processed in the callee context(right hand side). However its a nice high level overview.

As it happens you can disable the PreQueue entirely by setting /proc/sys/net/ipv4/tcp_low_latency to 1 and ... unfortunately need to mess with the kernel sauce a little to fully disable it. What does the latency look like?

TCP 128B A -> B -> A latency tcp_low_latency = 0

TCP 128B A->B->A latency tcp_low_latency = 1

As you can see... its basically the same... if your generous a tiny bit faster - not what we hopped for. The interesting question of course is, why is there no difference? Breaking the plots into tcp processing in the syscall callee context (tcp_low_latency=0) vs the softirq/tasklet context (tcp_low_latency=1) we can see most of the time is spent switching contexts, or more specifically waiting for the correct context to be scheduled.

TCP total (tcp_low_latency = 0)

TCP total (tcp_low_latency=1)

The plots above are a little counter intuitive. What its measuring is the time from tcp softirq start, to the end of tcp processing. So with tcp_low_latency=0, this includes the switch time from softirq -> callee context, and tcp_low_latency=1 everything is processed in the softirq. Thus low latency enabled gives a lower number and all is good ... but errr... it isnt. If we then look at the time from the end of tcp processing(in the kernel) to after recv() in user space we get the following.

TCP kernel end - > userspace tcp_low_latency=0

TCP kernel end -> userspace tcp_low_latency=1

... and funny enough its the mirror image. the low latency setting time is huge because iit includes the softirq->callee context switch, and with it disabled, its all ready in the callee context thus significantly less - just kernel->userspace switch. Thus explaining why the total round trip latency numbers are about the same (orange charts).

At first glance it appears most of our tcp Rx latency is the linux kernel scheduler? e.g. how long it takes to switch in the callee context - the one that called recv(). Which kind of sucks and unexpected, and raises the question of why UDP Rx is different... but we have plenty of tools test this hypothesis.

4 comments:

Anonymous22 April 2010 at 06:03
Good post! Thanks! Waiting for UDP myths testing and more detail of tcp processing.
Anonymous22 November 2010 at 11:21
Nice :)

Thank you. I was actually investigating network optimization with a linux based router and some open source optimization solutions like Traffic squeezer http://trafficsqueezer.org and the traditional linux LARTC hacks, and got struck with this weird looking option, and so thank you also for those charts and benchmarks.

-Rodrick
Anonymous16 May 2012 at 09:46
Hi,
Interesting blog! Can you tell me how you metered your Linux box in order to get those graphs? ...I see you are doing some intra-machine latency testing, so you must be metering the box itself with some sort of tool...? I'm quite curious.

We're looking at modifying some of our sysctl settings, so that's why I ask.

Thanks.
-GreyGnome
brunogm17 May 2015 at 20:09
hi, tried the BFS cpu scheduler?

Note: only a member of this blog may post a comment.

hacking NASDAQ @ 500 FPS

Friday, 22 January 2010

the myth of /proc/sys/net/ipv4/tcp_low_latency

4 comments:

fmadio 10G packet capture

Blog Archive

About Me