hacking NASDAQ @ 500 FPS: 2010

Sunday 14 February 2010

read free

If we profile the Rx side, its basically the cost of register read + some cold ddr fetch + a bit of fluffing around, thus the next logical step is to ditch all register reads. Its quite easy as the MAC will dma descriptor status and the contents directly into DDR thus... lets just poll the Rx Descriptor Status instead of reading the Rx head register.

128B TCP round trip no register reads

128B TCP round trip register reads

128B TCP round trip mental -> linux metal (with reg reads)

As you can see we`re now a respectable 4-5,000ns faster than the linux stack! However care must be taken with this register free approach as im not certain what the memory ordering rules are here, e.g. if PCIe/SB writes are committed in-order or can be scheduled out of order. Its highly unlikely that by the time you`ve read the Rx Desc Status the payload data has not reached DDR. However, if you`ve got billions of dollars running through it you *really* want to be certain.

Finally the bulk of the speedup is from Machine B`s long ass register read times, thus the full NIC to NIC time on machine B for a 128B TCP packet is...

Machine B, NIC(CPU Rx) -> NIC(CPU Tx) latency

... around 410ns and thats with SW checksums. To put that in perspective, my shitty FIX Fast(Arcabook compact) SW decoder runs around 140ns/packet assuming its hot in the cache. So we can clearly speed up TCP loopback code, but whats the point when hw latency is the bottleneck?

Obviously the wire-wire time is a few microsecconds.. e.g. its going to take 1,000+ ns just for the MAC to write into DDR and conversely the a CPU write to hit the MAC... but it underscores how at this level the problem is all hardware/topology, as hw latency is an order of magnitude larger than sw... no fun... so time to get back my day job.

Saturday 13 February 2010

tcp metal

Quick update on what TCP metal vs linux look like? Our TCP stack is err... not finished but enough is done to get some sort of ballpark numbers.

128B TCP all metal

128B TCP metal->linux->metal

As you can see we`re only just beating out linux here, to the tune of 1-2,000ns - and we`re not even a complete tcp stack yet! Certainly not what I expected but goes to show linux is pretty dam good... or my first pass code is really sucky. So need to slash around 5,000 cycles off our implementation... things finally getting interesting.

linux udp overhead

An interesting experiment is to see the real cost of the linux UDP stack against our metal UDP stack. e.g. Measuring the round trip 128B UDP metal(A) -> metal(B) -> metal(A) stack VS metal(A) -> linux(B) -> metal(A) stack.

128B UDP roundtrip all metal

128B UDP roundtrip metal->linux->metal

As you can see above, linux overhead is around 4-5,000ns. Keep in mind this is using polling UDP linux loops, so your garden variety blocking socket it will be significantly more. One small note is the metal stack only offloads ethernet checksum with IP/UDP checksums all software.

On a random note, whats really unusual is how UDP payload size quite dramatically changes the profiles.

64B UDP metal

128B UDP metal

192B UDP metal

256B UDP metal

Not sure whats going on here. Certainly hope its its not PCIe/SB congestion e.g. the MAC`s DMA(of the Rx buffer) into DDR competing with register read bandwidth? surely not. One thing is quite clear tho, the spacing of the spikes are most likely the latency for a single read on machine B... question is, wtf are there so many peeks in the larger transfers? bizarre.

Sunday 7 February 2010

metal to metal

Been busy on some other stuff the last week or so, and time to get back to this. Our last investigation looked into how the stock linux scheduler is responsible for a fair chunk of the network latency we`ve been seeing. Thus the logical question is, what happens if we ditch it all and write to the metal?

Intel hardware is great for this, the hw docs are excellent and reference sauce/linux drivers are nice and easy to follow. Whats the setup? pretty simple, wrote a linux driver which maps BAR0 (register space) and BAR1(flash memory) directly into user space, then allocate and map into the users address space 4 dma buffers, Rx Descriptors, Rx Buffer, Tx Desciptors, Tx Buffer. Which is all we need to write our own PHY/MAC/Ether/IPV4 + ARP/ICMP/UDP/TCP stack. Its a straight forward process, with the only hairy part being the TCP stack. Got everything up and running except TCP in a few days as NIC hardware is pretty dam simple and connection-less protocols are nice and easy. Certainly not production ready but enough to get some interesting results.

Lets start with round trip latency. Its the same 2 machines, we`re using ICMP 64byte payload for this test - note the time scales are different.

64B ICMP round trip (metal)

128B TCP round trip (polling)

Our metal gets on average around 19,000ns for the total round trip, so we can average and say 9,500ns on each side, but it isnt really 50/50 as we shall see. Still while not apples-apples comparison, compared to our best result so far, polling TCP, we are about 15,000ns or so faster than the linux stack, almost x2 nice!

So where does that 19,000ns go? Its quite surprising and actually rather depressing how shite x86 IO architecture is. First the results

Machine A - latency 1 32b register read

Machine B - latency 1 32b register read

... and as you can see, Machine A takes about 800ns to read 1 register, and Machine B is 2,200ns for *one* read. Why are the 2 machines so different? as discussed before Machine B has its NIC on the PCIe bus, while Machine A is wired directly to the south bridge. Thus while machine B is a fancy pants latest and greatest nehalem and Machine A an old dog, 2007(ish) Xeon, the old dog wins and really drives home the point how important physical hardware topology is at this performance level.

Your regular every day drivers will be interrupt (ish.. NAPI polling) driven, starting with an Rx isr that reads the interrupt status register to work out wtf is going on, then reads the Rx descriptor head register, to see how much/where the new packets are. If we summarize using fancy pants machine B.

MAC -> CPU isr latency : 2,200 / 2 = 1,100ns
CPU isr status register read: 2,200ns
CPU rx desc head register read: 2,200ns

Putting latency at around 5,500ns, that`s *before* protocol processing, and we haven't even looked at the Rx descriptor or Rx buffer data yet!

Machine A Rx Desciptor load(DDR)

Machine B Rx Desciptor load(DDR)

Rx descriptors are in DDR and thankfully can be cached, however the problem is when a device writes into DDR, i beleive, and numbers seem to agree, it invalidates the line(s) from all levels of the cache, which kind of sucks. I kind of expected nehalem would just invalidate the l1/l2, and update the L3, but seems its not that smart Thus in the plots above we see the cost of a l1/l2/l3 miss, so around 100ns/270 cycles on Machine A, and 75ns/200cycles on Machine B. With Machine B faster due to the memory controller being on the CPU instead of the north bridge - one less bus to travel.

In summary

Wire -> PHY : ?ns

PHY -> MAC : ?ns

MAC -> CPU isr latency : 2,200 / 2 = 1,100ns
CPU isr status register read : 2,200
CPU rx desc head read : 2,200

CPU rx desc read : 75ns

CPU rx buffer read : 75ns

Total: 5,650ns or 15,000 cycles already! Obviously we can pipeline some of this but you wont find that and other optimizations in the standard drivers. Actually on x86, not certain you can issue multiple non cached reads, that are not guarded/serialized e.g. pipelined. Moral of the story, its a challenge to get single digit microsecond latency to turn a trade around (Rx -> Tx), measured from the wire, using standard PC server hardware.

Saturday 23 January 2010

round trip -10us

As we found in the previous post our hypothesis is, most of the latency is in the switch from softirq/tasklet to the callee context aka a scheduler problem. So if this is correct, a polling recv instead of blocking should give nice speedups, with of course higher cpu usage, meaning your HVAC and power bill goes up.

TCP 128B A->B->A round trip latency. blocking recv() x2

TCP 128B A->B-A round trip latency. polling recv() x2

... and wow, what a difference with just a few lines of code! and confirms we need to hack on the linux scheduler. Final speedup being around 10,000ns+ so 5,000ns on each side (A recv, B recv) with a very nice, small stddev - woot.

The conventional wisdom is "polling is bad" translating to bad programmer, where your meant to do something fancy/smart as the latency is small. If small means 100us, its a reasonable assumption however 100us isnt small in HFT. Thus for low latency environments, where we are counting nanoseconds, and theres more cycles/core than you can shake a stick at, you really should be using non-blocking, polling socket loops. Maybe ditch traditional interrupt based device drivers too :)

... or hack on the kernel scheduler lol

Friday 22 January 2010

the myth of /proc/sys/net/ipv4/tcp_low_latency

Under linux on the TCP Rx side, there are 3 queues, the PreQueue, ReceiveQueue, and BacklogQueue. Every time a valid TCP segment arrives its placed in one of these. According to goggle, linux is unique in having a PreQeueue as most TCP stacks only have Receive and Backlog where the theory behind PreQueue is. You do the processing in the sockets callee context/core/cpu instead of in the softirq / tasklet which could be on a different core or even cpu.

The above flow is shamelessly stolen from "The Performance Analysis of Linux Networking – Packet Receiving" by Wenji Wu, Matt Crawford of Fermilab. It was written in 2006 and im pretty sure its slightly wrong. As once an packet is put on the PreQueue no further processing is done in the softirq, where tcp_v4_do_rcv() + friends are processed in the callee context(right hand side). However its a nice high level overview.

As it happens you can disable the PreQueue entirely by setting /proc/sys/net/ipv4/tcp_low_latency to 1 and ... unfortunately need to mess with the kernel sauce a little to fully disable it. What does the latency look like?

TCP 128B A -> B -> A latency tcp_low_latency = 0

TCP 128B A->B->A latency tcp_low_latency = 1

As you can see... its basically the same... if your generous a tiny bit faster - not what we hopped for. The interesting question of course is, why is there no difference? Breaking the plots into tcp processing in the syscall callee context (tcp_low_latency=0) vs the softirq/tasklet context (tcp_low_latency=1) we can see most of the time is spent switching contexts, or more specifically waiting for the correct context to be scheduled.

TCP total (tcp_low_latency = 0)

TCP total (tcp_low_latency=1)

The plots above are a little counter intuitive. What its measuring is the time from tcp softirq start, to the end of tcp processing. So with tcp_low_latency=0, this includes the switch time from softirq -> callee context, and tcp_low_latency=1 everything is processed in the softirq. Thus low latency enabled gives a lower number and all is good ... but errr... it isnt. If we then look at the time from the end of tcp processing(in the kernel) to after recv() in user space we get the following.

TCP kernel end - > userspace tcp_low_latency=0

TCP kernel end -> userspace tcp_low_latency=1

... and funny enough its the mirror image. the low latency setting time is huge because iit includes the softirq->callee context switch, and with it disabled, its all ready in the callee context thus significantly less - just kernel->userspace switch. Thus explaining why the total round trip latency numbers are about the same (orange charts).

At first glance it appears most of our tcp Rx latency is the linux kernel scheduler? e.g. how long it takes to switch in the callee context - the one that called recv(). Which kind of sucks and unexpected, and raises the question of why UDP Rx is different... but we have plenty of tools test this hypothesis.

Thursday 21 January 2010

TCP Rx processing

Previous post looked at things in a more macro level so lets dig a bit deeper into the stack to find out whats going on. We break the plots up in driver / ip / tcp / user and we get the following

TCP 128B round trip total

NIC Driver time

IP processing time

TCP processing time

Kernel -> User switch

Which is the expected result, TCP processing time becomes the bottleneck, but what is it actually doing? Digging down a bit further we get:

TCP top level processing + prequeue

TCP tcp_rcv_established()

Which is rather surprising, it appears the top level processing in tcp_v4_rcv() is where the bulk of the time goes! Not what you expect when tcp_rcv_established() is the main work horse. However.. its gets stranger.

TCP before prequeue -> tcp_rcv_establish()

Turns out most of the time goes somewhere between pushing the packet onto the tcp prequeue and actually processing it in tcp_rcv_established(). Not sure whats going on there, but surprisingly its where all the action is.

Wednesday 20 January 2010

the gap

Too much software, too many switchs, too many dials... too much variablility.... how do you make a linux system stable at this timing level? The previous UDP charts were from last weeks tests so what happens if we run the exact same 128B ping-pong with UDP, using the same kernel, same driver, not even a reboot and ...

UDP 128B latency A -> B -> A

TCP 128B latency A -> B -> A

... the numbers pretty closely match our 7,000ns delta, which is roughly the difference seen in the Rx/Ty handlers so we are in the right ball park and it looks good - kind of.

Life and times of Reliable Tx

For our on-going investigation into the bowels of the networking stack, we looked at the w latency of the UDP stack, thus the next logical step is TCP. Alot of people turn their nose up at TCP for low latency connections, saying it buffers too much, the latency is too high, your better off using UDP which is great for a certain class of communications, say lossy network game physics. However in finance dropping a few updates is death, and not an option.

Theres 2 general approachs:

1) build a shitty version of TCP ontop of UDP. This is the classic "not invented here" syndrome many developers fall into.
2) use TCP and optimize it for the situation.

In graphics, OpenGL / Direct3D theres a "fast path" for the operations/state/driver that's typically the application bottleneck, which the driver/stack engineers aggressively optimize for. If you change the state such that its no longer on the fast path, it goes though the slower generic code path, produces correct results, but is significantly slower. This approach is to have the best of both worlds, a nice feature rich API but has lightning fast performance for specific use cases.

If we take this philosophy and apply it to the network stack, theres no reason you cant get UDP or better level performance for a specific use case, say short 128B low latency sends but fall back to the more generic/slower code path when it occasionally drops a packet. Resulting in a dam fast, low latency protocol, thats reliable, in-order and most importantly the de-facto standard. And with that...lets put on the rubber gloves and delve into the TCP stack.

First up, lets take a high level view and compare the round trip latency of 128B message of UDP vs TCP. Keep in mind this is all on an un-loaded system, the UDP numbers arent exactly 128B messages but close, so is more a guide than absolute comparison. The trick here, is assuming a 0% packet loss, and an already established TCP connection, then each send() will generate its own TCP segment and thus we can poke data into the payload. Hacky ... yes but easy and does the job for now.

round trip UDP A->B->A

round trip TCP A->B->A

Keep in mind the TCP time scale is x2 the UDP plot and it clocks in around say 35,000ns vs 50,000ns with TCP significantly slower - proving conventional wisdom. Where does the time go? First step is look at the time from application -> NIC on both Tx and Rx sides for Machine A.

UDP sendto() -> Tx descriptor

TCP send() -> Tx descriptor

Above plots are on the Tx side of the equation, which is pretty good, not a huge difference considering the UDP vs TCP delta in round trip. So it must be in the Rx logic where TCP has problems?

UDP Rx Intr -> recvfrom()

TCP Rx Intr -> recv()

... and we see the Rx is about x2 slower in TCP than UDP, around 2,500ns vs 1,200ns. Not sure whats going on there, obviously related to ACKing each TCP segment its received, but x2 slower ? we can do better for this use case.

Comparing the round trip latency, we are missing about 15,000ns. Machine A is say, a generous 3,000ns so where did 12,000ns go? Onto Machine B. Remember Machine A NIC is directly wired to the SouthBridge vs Machine B has to go via PCIexpress, thus the latency differences between the machines.

UDP Machine B Rx Intr -> recvfrom()

Machine B TCP Rx Intr -> recv()

On the Rx side its kind of interesting, having a peek almost exactly on 5,000ns is a bit suspicious, yet its slightly faster than UDP - which is ... a little strange. Then a large chunk, over half the transfers around 8,000ns, so another say 3,000ns or so just for Machine B Rx.

Machine B UDP sendto() -> Tx descriptor

Machine B send() -> Tx Descriptor

As with Machine A, the Tx side is fairly consistent with UDP, even to the point of peeks roughly of the same pitch, if slightly translated. Its interesting TCP is somehow slightly faster to hit the NIC -likely differences in datasize.

So we have accounted for a bit over half of the time delta between TCP vs UDP, but where did the rest of the time go? hardware ? seems unlikely. More likely is the UDP vs TCP test data is different enough? Or maybe after many kernel and driver rebuilds the settings are slightly different?

In anycase its surprusing how close the performance is for small sends. Next task is to look into TCP Rx side and see why its not competitive with UDP.

Saturday 16 January 2010

kernel scheduler

The double peek in the Rx -> recvfrom() specifically the kernel -> userland switch looked suspiciously like some sort of core/hardware interaction. So, what happens if we change the # cores. Its really simple to do, just add maxcpus=0 to the kernel boot command. And thus the following plots are generated

2 Core sendto() -> Tx Desc

1 Core sendto() -> Tx Desc

Which is kind of interesting, not sure how/why the 1 Core sendto() has quite a few sample points < 1,000ns where the 2 Core version has none, other than that nothing too exciting.

2 Core Rx Intr -> recvfrom()

1 Core Rx Intr -> recvfrom()

OTOH receive shows quite a substantial change and as we suspected, it goes from a double peek, to a single peek assumed to be kernel -> userland signaling behaviour. And ...

2 Core udp finish kernel space -> userspace recvfrom()

1 Core udp finish kernel space -> userspace recvfrom()

... the plots speak for them self. Strangely, adding cores in some cases increases latency (the 2nd peek),. No idea whats going on, but keep in mind this is a blocking recvfrom() call so its obviously related to how linux scheduler deals with signals.

hacking NASDAQ @ 500 FPS