Wednesday, 20 January 2010

Life and times of Reliable Tx

For our on-going investigation into the bowels of the networking stack, we looked at the w latency of  the UDP stack, thus the next logical step is TCP. Alot of people turn their nose up at TCP for low latency connections, saying it buffers too much, the latency is too high, your better off using UDP which is great for a certain class of communications, say lossy network game physics. However in finance dropping a few updates is death, and not an option.

Theres  2 general approachs:

1) build a shitty version of TCP ontop of UDP. This is the classic "not invented here" syndrome many developers fall into.
2) use TCP and optimize it for the situation.

In graphics, OpenGL / Direct3D theres a "fast path" for the operations/state/driver that's typically the application bottleneck, which the driver/stack engineers aggressively optimize for. If you change the state such that its no longer on the fast path, it goes though the slower generic code path, produces correct results, but is significantly slower. This approach is to have the best of both worlds, a nice feature rich API but has lightning fast performance for specific use cases.

If we take this philosophy and apply it to the network stack, theres no reason you cant get UDP or better level performance for a specific use case, say short 128B low latency sends but fall back to the more generic/slower code path when it occasionally drops a packet. Resulting in a dam fast, low latency protocol, thats reliable, in-order and most importantly the de-facto standard. And with that...lets put on the rubber gloves and delve into the TCP stack.

First up, lets take a high level view and compare the round trip latency of 128B message of UDP vs TCP. Keep in mind this is all on an un-loaded system, the UDP numbers arent exactly 128B messages but close, so is more a guide than absolute comparison. The trick here, is assuming a 0% packet loss, and an already established TCP connection, then each send() will generate its own TCP segment and thus we can poke data into the payload. Hacky ... yes but easy and does the job for now.


round trip UDP A->B->A



round trip TCP A->B->A

Keep in mind the TCP time scale is x2 the UDP plot and it clocks in around say 35,000ns vs 50,000ns with TCP significantly slower - proving conventional wisdom. Where does the time go? First step is look at the time from application -> NIC on both Tx and Rx sides for Machine A.


UDP sendto() -> Tx descriptor

 
TCP send() -> Tx descriptor

Above plots are on the Tx side of the equation, which is pretty good, not a huge difference considering the UDP vs TCP delta in round trip. So it must be in the Rx logic where TCP has problems?


UDP Rx Intr -> recvfrom()

 
TCP Rx Intr -> recv()

... and we see the Rx is about x2 slower in TCP than UDP, around 2,500ns vs 1,200ns. Not sure whats going on there, obviously related to ACKing each TCP segment its received, but x2 slower ? we can do better for this use case.

Comparing the round trip latency, we are missing about 15,000ns. Machine A is say, a generous 3,000ns so where did 12,000ns go? Onto Machine B. Remember Machine A NIC is directly wired to the SouthBridge vs Machine B has to go via PCIexpress,  thus the latency differences between the machines.


UDP Machine B Rx Intr -> recvfrom()


Machine B TCP Rx Intr -> recv()

On the Rx side its kind of interesting, having a peek almost exactly on 5,000ns is a bit suspicious, yet its slightly faster than UDP - which is ... a little strange. Then a large chunk, over half the transfers around 8,000ns, so another say 3,000ns or so just for Machine B Rx.


Machine B UDP sendto() -> Tx descriptor


Machine B send() -> Tx Descriptor

As with Machine A, the Tx side is fairly consistent with UDP, even to the point of peeks roughly of the same pitch, if slightly translated. Its interesting TCP is somehow slightly faster to hit the NIC -likely differences in datasize.

So we have accounted for a bit over half of the time delta between TCP vs UDP, but where did the rest of the time go? hardware ? seems unlikely. More likely is the UDP vs TCP test data is different enough? Or maybe after many kernel and driver rebuilds the settings are slightly different?

In anycase its surprusing how close the performance is for small sends. Next task is to look into TCP Rx side and see why its not competitive with UDP.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.