Tuesday 12 January 2010

Live and times of an Rx packet

To complete the picture we need to look at the Rx packet flow, from when linux receives the Rx interrupt to when the user gets the packet from recvfrom(). First up is the high level view, total time from interrupt acknowledge -> recvfrom().


intr ack -> recvfrom()
And it looks fairly similar to our other plots. This has NAPI disabled and separate Rx/Tx handlers, its interesting that using NAPI the latency goes lower(18,000ns) and higher(35,000ns) presumably luck of the draw when polling @ 300Hz (softirq).

Whan happens after linux receives the irq? the network drivers irq hander is invoked where it acknowedges and clears the interrupt. Nothing particuarly interesting yet does take quite a bit of time doing... something? L1/L2 miss reloads? or just latency of reading registers? not sure. Plot is below, around 600ns or so



irq vector -> driver Rx clear

After the intr has been cleared, the driver reads in the Rx descriptors and uses the CPU to copy the packet (from device DMA ring buffer) onto the usual socket RECV buffer.This (RECV) buffer is the one people usually talk about when discussing sockets.

 
device buffer -> socket buffer copy

As you can see (above) the histogram is a bit weird, a clear chunk followed by this longtail of stuff. Guessing this is partially ddr fetch latency, atleast the 1000ns part as its only copying 128B + UDP + IP + Ethernet - not much. Also destination mem address might not be aligned correctly to enable write combining, so its doing a RMW + miss on the source fetch, hope not but its an old x86 processor.The flat 2500ns might be unmapping the packets DMA area, where some sort of kernel / pci functions are at work. On a side note, not sure why it unmapps it, and then re-mapps it when the Rx descriptor is free and ready, surely its not for security?

After the payload has been copied, it does IP processing / sanity checking  which is a very small profile so no plot included - it caches all ip/device/socket  info from the previous packet.
 
  
netfilter PRE_ROUTING

Once IP processing is done its off to the netfilter. PRE_ROUTING is quite minimal (above) does nothing - no rules are defined.

 
netfiler LOCAL_IN

And the same for LOCAL_IN (above) does basically nothing too - no rules are defined.


udp processing

Finally UDP processing(above) things get interesting. Firstly the spread on the plot is quite large, so somethings going on there. The really interesting part is the height - probably hard to see, but theres is a 100% column is time bin 0. E.g. most of the time, udp processing is extremely low, then occasionally it does *something*. Not sure what but definitely a case for investigation.


UDP (kernel) -> recvfrom(user)

Finally the packet arrives in userspace(above) which is the source of our 2 peeks in total latency. The reason for this ? not sure, likely related the blocking/signaling behaviour of the blocking recvfrom() call. Possible theory for size and pitch is the softirq timer frequency (300Mhz default). Where a quick tests is to increase/decrease this frequency, rebuild kernel, run, test and check the result.


Rx latency summary

Its difficult to summarize each module with a single digit as the stdev for each component can vary significantly, however the above is a rough guide for a non-napi configured driver and stack on 2.6.30.10. Its really hard to reproduce the exact numbers, just booting the stock arch-linux distro kernel, which is the source .config file for this 2.6.30.10 kernel shows a wildly different profile. Or reloading the Network driver to many times causes weirdness, such is life when theres a metric ton of code running.

Line count on C files for linux-2.6.30.10/net clocks in at around 912K LOC. net/ipv4 clocks in 128K LOC ...  obviously Ethernet+IP/UDP+Intel e1000e is a fraction of that but even 20k LOC is a significant chunk of logic to make tuning at the microsecond level truly a fine art.

3 comments:

  1. Thanks for all your posts, very interesting! What tool do you use to produce such nice latency charts?

    ReplyDelete
  2. Happy you find it interesting. Charts are produced using

    GLE (http://www.gle-graphics.org/)

    and

    rdtsc (http://en.wikipedia.org/wiki/Time_Stamp_Counter)

    and

    code scattered throughout the entire sw stack

    ReplyDelete
  3. I am aware of RDTSC but never used it for counters, as GetSystemTimeAsFileTime was usually enough for my purposes (on Windows platform).
    Excellent work, thanks again!

    ReplyDelete

Note: only a member of this blog may post a comment.