hacking NASDAQ @ 500 FPS: February 2010

Sunday, 14 February 2010

read free

If we profile the Rx side, its basically the cost of register read + some cold ddr fetch + a bit of fluffing around, thus the next logical step is to ditch all register reads. Its quite easy as the MAC will dma descriptor status and the contents directly into DDR thus... lets just poll the Rx Descriptor Status instead of reading the Rx head register.

128B TCP round trip no register reads

128B TCP round trip register reads

128B TCP round trip mental -> linux metal (with reg reads)

As you can see we`re now a respectable 4-5,000ns faster than the linux stack! However care must be taken with this register free approach as im not certain what the memory ordering rules are here, e.g. if PCIe/SB writes are committed in-order or can be scheduled out of order. Its highly unlikely that by the time you`ve read the Rx Desc Status the payload data has not reached DDR. However, if you`ve got billions of dollars running through it you *really* want to be certain.

Finally the bulk of the speedup is from Machine B`s long ass register read times, thus the full NIC to NIC time on machine B for a 128B TCP packet is...

Machine B, NIC(CPU Rx) -> NIC(CPU Tx) latency

... around 410ns and thats with SW checksums. To put that in perspective, my shitty FIX Fast(Arcabook compact) SW decoder runs around 140ns/packet assuming its hot in the cache. So we can clearly speed up TCP loopback code, but whats the point when hw latency is the bottleneck?

Obviously the wire-wire time is a few microsecconds.. e.g. its going to take 1,000+ ns just for the MAC to write into DDR and conversely the a CPU write to hit the MAC... but it underscores how at this level the problem is all hardware/topology, as hw latency is an order of magnitude larger than sw... no fun... so time to get back my day job.

Saturday, 13 February 2010

tcp metal

Quick update on what TCP metal vs linux look like? Our TCP stack is err... not finished but enough is done to get some sort of ballpark numbers.

128B TCP all metal

128B TCP metal->linux->metal

As you can see we`re only just beating out linux here, to the tune of 1-2,000ns - and we`re not even a complete tcp stack yet! Certainly not what I expected but goes to show linux is pretty dam good... or my first pass code is really sucky. So need to slash around 5,000 cycles off our implementation... things finally getting interesting.

linux udp overhead

An interesting experiment is to see the real cost of the linux UDP stack against our metal UDP stack. e.g. Measuring the round trip 128B UDP metal(A) -> metal(B) -> metal(A) stack VS metal(A) -> linux(B) -> metal(A) stack.

128B UDP roundtrip all metal

128B UDP roundtrip metal->linux->metal

As you can see above, linux overhead is around 4-5,000ns. Keep in mind this is using polling UDP linux loops, so your garden variety blocking socket it will be significantly more. One small note is the metal stack only offloads ethernet checksum with IP/UDP checksums all software.

On a random note, whats really unusual is how UDP payload size quite dramatically changes the profiles.

64B UDP metal

128B UDP metal

192B UDP metal

256B UDP metal

Not sure whats going on here. Certainly hope its its not PCIe/SB congestion e.g. the MAC`s DMA(of the Rx buffer) into DDR competing with register read bandwidth? surely not. One thing is quite clear tho, the spacing of the spikes are most likely the latency for a single read on machine B... question is, wtf are there so many peeks in the larger transfers? bizarre.

Sunday, 7 February 2010

metal to metal

Been busy on some other stuff the last week or so, and time to get back to this. Our last investigation looked into how the stock linux scheduler is responsible for a fair chunk of the network latency we`ve been seeing. Thus the logical question is, what happens if we ditch it all and write to the metal?

Intel hardware is great for this, the hw docs are excellent and reference sauce/linux drivers are nice and easy to follow. Whats the setup? pretty simple, wrote a linux driver which maps BAR0 (register space) and BAR1(flash memory) directly into user space, then allocate and map into the users address space 4 dma buffers, Rx Descriptors, Rx Buffer, Tx Desciptors, Tx Buffer. Which is all we need to write our own PHY/MAC/Ether/IPV4 + ARP/ICMP/UDP/TCP stack. Its a straight forward process, with the only hairy part being the TCP stack. Got everything up and running except TCP in a few days as NIC hardware is pretty dam simple and connection-less protocols are nice and easy. Certainly not production ready but enough to get some interesting results.

Lets start with round trip latency. Its the same 2 machines, we`re using ICMP 64byte payload for this test - note the time scales are different.

64B ICMP round trip (metal)

128B TCP round trip (polling)

Our metal gets on average around 19,000ns for the total round trip, so we can average and say 9,500ns on each side, but it isnt really 50/50 as we shall see. Still while not apples-apples comparison, compared to our best result so far, polling TCP, we are about 15,000ns or so faster than the linux stack, almost x2 nice!

So where does that 19,000ns go? Its quite surprising and actually rather depressing how shite x86 IO architecture is. First the results

Machine A - latency 1 32b register read

Machine B - latency 1 32b register read

... and as you can see, Machine A takes about 800ns to read 1 register, and Machine B is 2,200ns for *one* read. Why are the 2 machines so different? as discussed before Machine B has its NIC on the PCIe bus, while Machine A is wired directly to the south bridge. Thus while machine B is a fancy pants latest and greatest nehalem and Machine A an old dog, 2007(ish) Xeon, the old dog wins and really drives home the point how important physical hardware topology is at this performance level.

Your regular every day drivers will be interrupt (ish.. NAPI polling) driven, starting with an Rx isr that reads the interrupt status register to work out wtf is going on, then reads the Rx descriptor head register, to see how much/where the new packets are. If we summarize using fancy pants machine B.

MAC -> CPU isr latency : 2,200 / 2 = 1,100ns
CPU isr status register read: 2,200ns
CPU rx desc head register read: 2,200ns

Putting latency at around 5,500ns, that`s *before* protocol processing, and we haven't even looked at the Rx descriptor or Rx buffer data yet!

Machine A Rx Desciptor load(DDR)

Machine B Rx Desciptor load(DDR)

Rx descriptors are in DDR and thankfully can be cached, however the problem is when a device writes into DDR, i beleive, and numbers seem to agree, it invalidates the line(s) from all levels of the cache, which kind of sucks. I kind of expected nehalem would just invalidate the l1/l2, and update the L3, but seems its not that smart Thus in the plots above we see the cost of a l1/l2/l3 miss, so around 100ns/270 cycles on Machine A, and 75ns/200cycles on Machine B. With Machine B faster due to the memory controller being on the CPU instead of the north bridge - one less bus to travel.

In summary

Wire -> PHY : ?ns

PHY -> MAC : ?ns

MAC -> CPU isr latency : 2,200 / 2 = 1,100ns
CPU isr status register read : 2,200
CPU rx desc head read : 2,200

CPU rx desc read : 75ns

CPU rx buffer read : 75ns

Total: 5,650ns or 15,000 cycles already! Obviously we can pipeline some of this but you wont find that and other optimizations in the standard drivers. Actually on x86, not certain you can issue multiple non cached reads, that are not guarded/serialized e.g. pipelined. Moral of the story, its a challenge to get single digit microsecond latency to turn a trade around (Rx -> Tx), measured from the wire, using standard PC server hardware.

hacking NASDAQ @ 500 FPS