hacking NASDAQ @ 500 FPS: January 2010

Saturday, 23 January 2010

round trip -10us

As we found in the previous post our hypothesis is, most of the latency is in the switch from softirq/tasklet to the callee context aka a scheduler problem. So if this is correct, a polling recv instead of blocking should give nice speedups, with of course higher cpu usage, meaning your HVAC and power bill goes up.

TCP 128B A->B->A round trip latency. blocking recv() x2

TCP 128B A->B-A round trip latency. polling recv() x2

... and wow, what a difference with just a few lines of code! and confirms we need to hack on the linux scheduler. Final speedup being around 10,000ns+ so 5,000ns on each side (A recv, B recv) with a very nice, small stddev - woot.

The conventional wisdom is "polling is bad" translating to bad programmer, where your meant to do something fancy/smart as the latency is small. If small means 100us, its a reasonable assumption however 100us isnt small in HFT. Thus for low latency environments, where we are counting nanoseconds, and theres more cycles/core than you can shake a stick at, you really should be using non-blocking, polling socket loops. Maybe ditch traditional interrupt based device drivers too :)

... or hack on the kernel scheduler lol

Friday, 22 January 2010

the myth of /proc/sys/net/ipv4/tcp_low_latency

Under linux on the TCP Rx side, there are 3 queues, the PreQueue, ReceiveQueue, and BacklogQueue. Every time a valid TCP segment arrives its placed in one of these. According to goggle, linux is unique in having a PreQeueue as most TCP stacks only have Receive and Backlog where the theory behind PreQueue is. You do the processing in the sockets callee context/core/cpu instead of in the softirq / tasklet which could be on a different core or even cpu.

The above flow is shamelessly stolen from "The Performance Analysis of Linux Networking – Packet Receiving" by Wenji Wu, Matt Crawford of Fermilab. It was written in 2006 and im pretty sure its slightly wrong. As once an packet is put on the PreQueue no further processing is done in the softirq, where tcp_v4_do_rcv() + friends are processed in the callee context(right hand side). However its a nice high level overview.

As it happens you can disable the PreQueue entirely by setting /proc/sys/net/ipv4/tcp_low_latency to 1 and ... unfortunately need to mess with the kernel sauce a little to fully disable it. What does the latency look like?

TCP 128B A -> B -> A latency tcp_low_latency = 0

TCP 128B A->B->A latency tcp_low_latency = 1

As you can see... its basically the same... if your generous a tiny bit faster - not what we hopped for. The interesting question of course is, why is there no difference? Breaking the plots into tcp processing in the syscall callee context (tcp_low_latency=0) vs the softirq/tasklet context (tcp_low_latency=1) we can see most of the time is spent switching contexts, or more specifically waiting for the correct context to be scheduled.

TCP total (tcp_low_latency = 0)

TCP total (tcp_low_latency=1)

The plots above are a little counter intuitive. What its measuring is the time from tcp softirq start, to the end of tcp processing. So with tcp_low_latency=0, this includes the switch time from softirq -> callee context, and tcp_low_latency=1 everything is processed in the softirq. Thus low latency enabled gives a lower number and all is good ... but errr... it isnt. If we then look at the time from the end of tcp processing(in the kernel) to after recv() in user space we get the following.

TCP kernel end - > userspace tcp_low_latency=0

TCP kernel end -> userspace tcp_low_latency=1

... and funny enough its the mirror image. the low latency setting time is huge because iit includes the softirq->callee context switch, and with it disabled, its all ready in the callee context thus significantly less - just kernel->userspace switch. Thus explaining why the total round trip latency numbers are about the same (orange charts).

At first glance it appears most of our tcp Rx latency is the linux kernel scheduler? e.g. how long it takes to switch in the callee context - the one that called recv(). Which kind of sucks and unexpected, and raises the question of why UDP Rx is different... but we have plenty of tools test this hypothesis.

Thursday, 21 January 2010

TCP Rx processing

Previous post looked at things in a more macro level so lets dig a bit deeper into the stack to find out whats going on. We break the plots up in driver / ip / tcp / user and we get the following

TCP 128B round trip total

NIC Driver time

IP processing time

TCP processing time

Kernel -> User switch

Which is the expected result, TCP processing time becomes the bottleneck, but what is it actually doing? Digging down a bit further we get:

TCP top level processing + prequeue

TCP tcp_rcv_established()

Which is rather surprising, it appears the top level processing in tcp_v4_rcv() is where the bulk of the time goes! Not what you expect when tcp_rcv_established() is the main work horse. However.. its gets stranger.

TCP before prequeue -> tcp_rcv_establish()

Turns out most of the time goes somewhere between pushing the packet onto the tcp prequeue and actually processing it in tcp_rcv_established(). Not sure whats going on there, but surprisingly its where all the action is.

Wednesday, 20 January 2010

the gap

Too much software, too many switchs, too many dials... too much variablility.... how do you make a linux system stable at this timing level? The previous UDP charts were from last weeks tests so what happens if we run the exact same 128B ping-pong with UDP, using the same kernel, same driver, not even a reboot and ...

UDP 128B latency A -> B -> A

TCP 128B latency A -> B -> A

... the numbers pretty closely match our 7,000ns delta, which is roughly the difference seen in the Rx/Ty handlers so we are in the right ball park and it looks good - kind of.

Life and times of Reliable Tx

For our on-going investigation into the bowels of the networking stack, we looked at the w latency of the UDP stack, thus the next logical step is TCP. Alot of people turn their nose up at TCP for low latency connections, saying it buffers too much, the latency is too high, your better off using UDP which is great for a certain class of communications, say lossy network game physics. However in finance dropping a few updates is death, and not an option.

Theres 2 general approachs:

1) build a shitty version of TCP ontop of UDP. This is the classic "not invented here" syndrome many developers fall into.
2) use TCP and optimize it for the situation.

In graphics, OpenGL / Direct3D theres a "fast path" for the operations/state/driver that's typically the application bottleneck, which the driver/stack engineers aggressively optimize for. If you change the state such that its no longer on the fast path, it goes though the slower generic code path, produces correct results, but is significantly slower. This approach is to have the best of both worlds, a nice feature rich API but has lightning fast performance for specific use cases.

If we take this philosophy and apply it to the network stack, theres no reason you cant get UDP or better level performance for a specific use case, say short 128B low latency sends but fall back to the more generic/slower code path when it occasionally drops a packet. Resulting in a dam fast, low latency protocol, thats reliable, in-order and most importantly the de-facto standard. And with that...lets put on the rubber gloves and delve into the TCP stack.

First up, lets take a high level view and compare the round trip latency of 128B message of UDP vs TCP. Keep in mind this is all on an un-loaded system, the UDP numbers arent exactly 128B messages but close, so is more a guide than absolute comparison. The trick here, is assuming a 0% packet loss, and an already established TCP connection, then each send() will generate its own TCP segment and thus we can poke data into the payload. Hacky ... yes but easy and does the job for now.

round trip UDP A->B->A

round trip TCP A->B->A

Keep in mind the TCP time scale is x2 the UDP plot and it clocks in around say 35,000ns vs 50,000ns with TCP significantly slower - proving conventional wisdom. Where does the time go? First step is look at the time from application -> NIC on both Tx and Rx sides for Machine A.

UDP sendto() -> Tx descriptor

TCP send() -> Tx descriptor

Above plots are on the Tx side of the equation, which is pretty good, not a huge difference considering the UDP vs TCP delta in round trip. So it must be in the Rx logic where TCP has problems?

UDP Rx Intr -> recvfrom()

TCP Rx Intr -> recv()

... and we see the Rx is about x2 slower in TCP than UDP, around 2,500ns vs 1,200ns. Not sure whats going on there, obviously related to ACKing each TCP segment its received, but x2 slower ? we can do better for this use case.

Comparing the round trip latency, we are missing about 15,000ns. Machine A is say, a generous 3,000ns so where did 12,000ns go? Onto Machine B. Remember Machine A NIC is directly wired to the SouthBridge vs Machine B has to go via PCIexpress, thus the latency differences between the machines.

UDP Machine B Rx Intr -> recvfrom()

Machine B TCP Rx Intr -> recv()

On the Rx side its kind of interesting, having a peek almost exactly on 5,000ns is a bit suspicious, yet its slightly faster than UDP - which is ... a little strange. Then a large chunk, over half the transfers around 8,000ns, so another say 3,000ns or so just for Machine B Rx.

Machine B UDP sendto() -> Tx descriptor

Machine B send() -> Tx Descriptor

As with Machine A, the Tx side is fairly consistent with UDP, even to the point of peeks roughly of the same pitch, if slightly translated. Its interesting TCP is somehow slightly faster to hit the NIC -likely differences in datasize.

So we have accounted for a bit over half of the time delta between TCP vs UDP, but where did the rest of the time go? hardware ? seems unlikely. More likely is the UDP vs TCP test data is different enough? Or maybe after many kernel and driver rebuilds the settings are slightly different?

In anycase its surprusing how close the performance is for small sends. Next task is to look into TCP Rx side and see why its not competitive with UDP.

Saturday, 16 January 2010

kernel scheduler

The double peek in the Rx -> recvfrom() specifically the kernel -> userland switch looked suspiciously like some sort of core/hardware interaction. So, what happens if we change the # cores. Its really simple to do, just add maxcpus=0 to the kernel boot command. And thus the following plots are generated

2 Core sendto() -> Tx Desc

1 Core sendto() -> Tx Desc

Which is kind of interesting, not sure how/why the 1 Core sendto() has quite a few sample points < 1,000ns where the 2 Core version has none, other than that nothing too exciting.

2 Core Rx Intr -> recvfrom()

1 Core Rx Intr -> recvfrom()

OTOH receive shows quite a substantial change and as we suspected, it goes from a double peek, to a single peek assumed to be kernel -> userland signaling behaviour. And ...

2 Core udp finish kernel space -> userspace recvfrom()

1 Core udp finish kernel space -> userspace recvfrom()

... the plots speak for them self. Strangely, adding cores in some cases increases latency (the 2nd peek),. No idea whats going on, but keep in mind this is a blocking recvfrom() call so its obviously related to how linux scheduler deals with signals.

Tuesday, 12 January 2010

Live and times of an Rx packet

To complete the picture we need to look at the Rx packet flow, from when linux receives the Rx interrupt to when the user gets the packet from recvfrom(). First up is the high level view, total time from interrupt acknowledge -> recvfrom().

intr ack -> recvfrom()

And it looks fairly similar to our other plots. This has NAPI disabled and separate Rx/Tx handlers, its interesting that using NAPI the latency goes lower(18,000ns) and higher(35,000ns) presumably luck of the draw when polling @ 300Hz (softirq).

Whan happens after linux receives the irq? the network drivers irq hander is invoked where it acknowedges and clears the interrupt. Nothing particuarly interesting yet does take quite a bit of time doing... something? L1/L2 miss reloads? or just latency of reading registers? not sure. Plot is below, around 600ns or so

irq vector -> driver Rx clear

After the intr has been cleared, the driver reads in the Rx descriptors and uses the CPU to copy the packet (from device DMA ring buffer) onto the usual socket RECV buffer.This (RECV) buffer is the one people usually talk about when discussing sockets.

device buffer -> socket buffer copy

As you can see (above) the histogram is a bit weird, a clear chunk followed by this longtail of stuff. Guessing this is partially ddr fetch latency, atleast the 1000ns part as its only copying 128B + UDP + IP + Ethernet - not much. Also destination mem address might not be aligned correctly to enable write combining, so its doing a RMW + miss on the source fetch, hope not but its an old x86 processor.The flat 2500ns might be unmapping the packets DMA area, where some sort of kernel / pci functions are at work. On a side note, not sure why it unmapps it, and then re-mapps it when the Rx descriptor is free and ready, surely its not for security?

After the payload has been copied, it does IP processing / sanity checking which is a very small profile so no plot included - it caches all ip/device/socket info from the previous packet.

netfilter PRE_ROUTING

Once IP processing is done its off to the netfilter. PRE_ROUTING is quite minimal (above) does nothing - no rules are defined.

netfiler LOCAL_IN

And the same for LOCAL_IN (above) does basically nothing too - no rules are defined.

udp processing

Finally UDP processing(above) things get interesting. Firstly the spread on the plot is quite large, so somethings going on there. The really interesting part is the height - probably hard to see, but theres is a 100% column is time bin 0. E.g. most of the time, udp processing is extremely low, then occasionally it does *something*. Not sure what but definitely a case for investigation.

UDP (kernel) -> recvfrom(user)

Finally the packet arrives in userspace(above) which is the source of our 2 peeks in total latency. The reason for this ? not sure, likely related the blocking/signaling behaviour of the blocking recvfrom() call. Possible theory for size and pitch is the softirq timer frequency (300Mhz default). Where a quick tests is to increase/decrease this frequency, rebuild kernel, run, test and check the result.

Rx latency summary

Its difficult to summarize each module with a single digit as the stdev for each component can vary significantly, however the above is a rough guide for a non-napi configured driver and stack on 2.6.30.10. Its really hard to reproduce the exact numbers, just booting the stock arch-linux distro kernel, which is the source .config file for this 2.6.30.10 kernel shows a wildly different profile. Or reloading the Network driver to many times causes weirdness, such is life when theres a metric ton of code running.

Line count on C files for linux-2.6.30.10/net clocks in at around 912K LOC. net/ipv4 clocks in 128K LOC ... obviously Ethernet+IP/UDP+Intel e1000e is a fraction of that but even 20k LOC is a significant chunk of logic to make tuning at the microsecond level truly a fine art.

Sunday, 10 January 2010

The life and Times of a Tx Packet

Finally.. got some time to spend on this. We got a rough high level view last time on where all the time went, so lets dig a bit deeper into the SW stack to find out what is going on. So.... lets get started using a stock kernel 2.6.30.10, build it, install it, run it and boom the first plot.

Machine A sendto() -> Tx desc

Which is our toplevel latency reference, of around 1500ns or so from the userlevel function call, to the NIC driver incrementing the Tx Descriptor ring. Not bad, and surprisingly quite a bit faster than our previous tests(2000-3000ns). Why this is, I've no idea, but likely slightly different kernel version and build parameters. Other strange thing is the "shadow graph", possible due to increasing the resolution of our histogram bin size (100ns -> 10ns) all timing is based on an old 2.6Ghz Xeon.

Hacking the networking core is a royal pain in the ass, theres no easy module to build, which means rebuilding the kernel and rebooting each time... paaaaaainfully slow dev cycle. But lets start by looking at glibc code, for sendto(), which does basically nothing - invoke a kernel command so first plot is kernel call overhead.

userland -> kernel overhead

Looks around 250ns on average. The double peeks are most likely due to the 2 hardware threads on the machine, so around 700cycles. One side note thats not evident in the plot is, the kernel overhead time drops from about 1200cycles at the start to averaging 700cycles quickly ~ 1000 calls.

The packet then arrives at udp_sendmsg() in the ipv4 udp code, where it does some misc packet header/buffer allocation and a few checks, finds the cached route and acquires a lock on the socket. General house keeping stuff.

kernel socket/packet house keeping

Housekeeping clocks in around the same as the kernel switch, 6-700cycles or about 250ns. After the packets has been checked, its copied into the sockets send buffer - this is what ppl generally think of when discussing socket buffers. Where it it memcpys the packet from userland into kernel space and enbales/maps PCI/DMA access from the NIC.

Userspace -> Kernelspace Packet copy

Histogram is a bit prickly for some reason, possibly due to PCI dma map commands, as the amount of data we:re copying is tiny - 128B and it should be in the L1 and definitely in L2 cache so not sure whats going on there. Its possible the combo of old hardware and un-aligned writes means the CPU is read-modify-write the destination mem cache line, instead of a driect write (no read) thus we pay the latency cost of an uncached memory fetch. Or... its just kernel dma/pci mapping code, not sure.

IP/UDP header write

After the payload is copied, the stack adds the appropriate IP/UDP headers(above) Nothing too interesting here, but is surprising how long it takes, ~150ns which.. is alot. Packet checksums are all offloaded onto the hardware, so its doing something else here.

Now it gets interesting, almost all stock kernel builds have netfilter enabled, to allow packet filter / routing / firewalls / vpns etc etc - very core usecases for linux. Theres a ton of books and documents on how to use netfilter/ipchains but in our case its entirely pass thru, in fact we should disable netfilter to reduce latency.

netfilter LOCAL passthru

netfilter POST pass thru

As you can see(above) its still quite fast, 80ns or so all up, but think its safe to assume the exchange isnt trying to h4x0r your machine and its all quite un-necessary.

After netfilter approves the packet, its sent to the NIC driver, using another buffering system, qdisc - queuing disciplines. This is MAC level now, typically a single fast priority FIFO per MAC but its completely configurable using the "tc" traffic control command and probably other tools. Qdisc is a powerful system, enabling various buffering, scheduling and filters to be applied but they all add latency, - not particularity helpful for low latency systems. In fact I intend to completely disable qdisc to reduce latency.

qdisc packet enqueue

Queing is fairly fast (above) around 130ns or so, the 2nd hump in the histogram is interesting.Guessing its wait time for a atomic lock. Now that our packet is on the queue for eth0, all that's left is for the net scheduler to issue it to the NIC driver. However, there's a nice optimization that after the packet is queued, it immediately attempts to send the packet to the driver, and in this case usually succeeds The only reason it can fail to immediately send is, if another hardware thread is running the net scheduler thus pushing data to the driver, e.g. we have a small fifo here to avoid dropping packets, but it does add another source of latency.

qdisc queue -> driver xmit

As expected (above), the latency from qdisc queue, to issuing a driver call is small 100ns or so. Whats interesting is the double spikes, assuming it misses the initial scheduling pass, and hits on the 2nd try.

NIC driver 1 Tx packet process time

And finally(above) our trusty e1000e NIC driver processing cost, which fills our the Tx descriptor and moves the ring buffer forward. Then frees the packet and is fairly quick to process 400ns. Note, this is the time from driver entry point, to exit point, which is longer than driver entry -> Tx update(below)/hardware hand off, due to cleanup code.

qdisc enqueue -> NIC Tx descriptor write

The question is, if the NIC driver is only taking 3-400ns to kick a Tx descriptor, then the rest of the time must be spent in the linux kernels networking stack?

sendto() -> the start of NIC driver handoff

Answer -> yes, most of the time is spent in the kernel.. Plot above shows the entire SW latency excluding the NIC driver where the shape matches the first high level plot (green one) except shifted slightly to the left. This is good as we have quite a few options to reduce the kernels processing time, to make that packet hit the MAC in < 1,000ns!

typical high level Tx hw/sw flow @ 2.6Ghz old Xeon machine

In summary, the above flow chart shows our current latency estimates. We can only guesstimate the hardware latency due to lack of tools but you can clearly see the HW latency is far greater than the software. As we are using a typical (old) consumer/server hardware layout thats designed for high throughput NOT ultra low latency. Which is why anyone serious about ultra low latency... has a very different hardware topology :)

hacking NASDAQ @ 500 FPS