Sunday, 11 May 2014

PCIe latency

Ever wondered what the RTT is over PCI Express ? or you may be asking wtf is PCIe RTT ?


PCIe is quite similar to ethernet in the sense its a packet bus, where as PCI and PCI-X are the older more traditional style parallel bus. The oversimplified way to look it is, inside your PC/Server is a small ethernet network with an MTU of 128B. Could rant about PCIe quite a bit but suggest an excellent write up here to see how it all works.

By PCIe RTT it means the time to ping/pong a packet from one EndPoint (think ethernet IP) to some other EndPoint and back again. The same as an ICMP ping to some IP address. How do you write such a test ? ... with an fpga of course! 

Xilinx`s Virtex 7 has a variety of SKU`s having between 1-4 PCIe Gen3 interfaces at 8 lines wide @ 8 GT/sec (note GT == Giga Transactions not GBytes). Thus the test is simply replying to an appropriately addressed PCIe TLP packet.

For the more visually inclined it looks like


Things to note are the i7 4771 is intels latest and greatest CPU with a native PCIe Gen3 interface wired directly to the CPU pads/pins. Also note this is all "Desktop" grade hardware tho... significance of this is debatable.


And what do the numbers look like ? ..... Pretty shity actually, surprisingly shitty infact.
Above is a plot of the RTT in nanosecconds.  X axis is sample count Y axis is nanos. You can see its ~ 850ns for the full Round trip... quite high to be honest. Time to dig in and investigate wtf the time has gone - and yes the card is in the correct PCIe slot!


Why so high? here`s a short list of ideas
  •  V7 PCIe end point
  •  1337 ping code
  •  Intel PCIe EP
  •  Intel Memory Model
The easiest of these is my "1337 ping code" as the code is already in a development style setup  thus getting Waves (waveforms) is straight forward.
 
Above is how a typical debug session goes. Run some simulation, wait a bit and check out the waves. In the above pic you can see the clock ticking over every 4ns, and 2 (out of 4) of Xilinx`s new PCIe IPCore interfaces. They are the m_axis_cq_* signals and m_axis_rq_* signals. The CPU`s register writes show up on the m_axis_cq_* interface, while our writes (to CPU`s DDR) are issued  on the s_axis_rq_* interface. If you want to dig deeper check out the spec

We can then count the number of cycles between receiving the CPU write (yellow line), and the first cycle of the "1337 ping code" write into CPU DDR (red line). Which is ~16 cycles where the interface is running at 250mhz means 4ns * 16 cycles = 64ns - not particularly great but workable. Meaning the search for the remaining ~800ns remains .... inconclusive.

.... next up will investigate writing directly into the CPU`s L2 cache from the fpga, thus avoiding any shenanigans involving the CPU`s DDR.

4 comments:

  1. I've not had much success writing directly to cache. I'm not sure that TLP Processing Hints are fully supported by current Intel chips - I couldn't achieve any measurable improvement at least, even brute-forcing every combination!

    You should be able to achieve ~500ns round trip though, even without writing to directly cache. How is your software mapping the memory?

    ReplyDelete
  2. Thanks for the comment, good to hear ~ 500ns should be doable.

    SW mapping is uncached for BAR0/registers and cached for the pong dma target. Code is something like

    rdtsc();
    write_reg(KICK);
    while (dma_area[0] != ExpectedValue);
    rdtsc();

    Interesting to you`ve seen little change when using TPH, Intel made alot of noise about this, maybe its just noise.

    ReplyDelete
  3. I found the noise about TPH but not much substance - post if you find anything out.

    I'm running similar benchmarks, uncached register kick and then spinning on DMA-able memory. Total time isn't significantly more than a simple blocking uncached read. What's your RTT for that?

    I'm on Altera Stratix V but I'd be surprised if there was much difference between the PCIe HardIPs of Altera/Xilinx...

    ReplyDelete
    Replies
    1. Most interesting, RTT for a simple uncached register read is ~ 480-500ns. In the same ballpark as your number.

      Certainly narrows down the search. Xilinx`s new V7 PCIe IPCore kinda sucks now as its no longer a raw TLP interface. Its now a psudo normalized TLP interface with separate paths for cpu initiated and device initiated transfers.

      Thanks!

      Delete