If we profile the Rx side, its basically the cost of register read + some cold ddr fetch + a bit of fluffing around, thus the next logical step is to ditch all register reads. Its quite easy as the MAC will dma descriptor status and the contents directly into DDR thus... lets just poll the Rx Descriptor Status instead of reading the Rx head register.
128B TCP round trip no register reads
128B TCP round trip register reads
128B TCP round trip mental -> linux metal (with reg reads)
As you can see we`re now a respectable 4-5,000ns faster than the linux stack! However care must be taken with this register free approach as im not certain what the memory ordering rules are here, e.g. if PCIe/SB writes are committed in-order or can be scheduled out of order. Its highly unlikely that by the time you`ve read the Rx Desc Status the payload data has not reached DDR. However, if you`ve got billions of dollars running through it you *really* want to be certain.
Finally the bulk of the speedup is from Machine B`s long ass register read times, thus the full NIC to NIC time on machine B for a 128B TCP packet is...
Machine B, NIC(CPU Rx) -> NIC(CPU Tx) latency
... around 410ns and thats with SW checksums. To put that in perspective, my shitty FIX Fast(Arcabook compact) SW decoder runs around 140ns/packet assuming its hot in the cache. So we can clearly speed up TCP loopback code, but whats the point when hw latency is the bottleneck?
Obviously the wire-wire time is a few microsecconds.. e.g. its going to take 1,000+ ns just for the MAC to write into DDR and conversely the a CPU write to hit the MAC... but it underscores how at this level the problem is all hardware/topology, as hw latency is an order of magnitude larger than sw... no fun... so time to get back my day job.