Sunday, 11 May 2014

PCIe latency

Ever wondered what the RTT is over PCI Express ? or you may be asking wtf is PCIe RTT ?


PCIe is quite similar to ethernet in the sense its a packet bus, where as PCI and PCI-X are the older more traditional style parallel bus. The oversimplified way to look it is, inside your PC/Server is a small ethernet network with an MTU of 128B. Could rant about PCIe quite a bit but suggest an excellent write up here to see how it all works.

By PCIe RTT it means the time to ping/pong a packet from one EndPoint (think ethernet IP) to some other EndPoint and back again. The same as an ICMP ping to some IP address. How do you write such a test ? ... with an fpga of course! 

Xilinx`s Virtex 7 has a variety of SKU`s having between 1-4 PCIe Gen3 interfaces at 8 lines wide @ 8 GT/sec (note GT == Giga Transactions not GBytes). Thus the test is simply replying to an appropriately addressed PCIe TLP packet.

For the more visually inclined it looks like


Things to note are the i7 4771 is intels latest and greatest CPU with a native PCIe Gen3 interface wired directly to the CPU pads/pins. Also note this is all "Desktop" grade hardware tho... significance of this is debatable.


And what do the numbers look like ? ..... Pretty shity actually, surprisingly shitty infact.
Above is a plot of the RTT in nanosecconds.  X axis is sample count Y axis is nanos. You can see its ~ 850ns for the full Round trip... quite high to be honest. Time to dig in and investigate wtf the time has gone - and yes the card is in the correct PCIe slot!


Why so high? here`s a short list of ideas
  •  V7 PCIe end point
  •  1337 ping code
  •  Intel PCIe EP
  •  Intel Memory Model
The easiest of these is my "1337 ping code" as the code is already in a development style setup  thus getting Waves (waveforms) is straight forward.
 
Above is how a typical debug session goes. Run some simulation, wait a bit and check out the waves. In the above pic you can see the clock ticking over every 4ns, and 2 (out of 4) of Xilinx`s new PCIe IPCore interfaces. They are the m_axis_cq_* signals and m_axis_rq_* signals. The CPU`s register writes show up on the m_axis_cq_* interface, while our writes (to CPU`s DDR) are issued  on the s_axis_rq_* interface. If you want to dig deeper check out the spec

We can then count the number of cycles between receiving the CPU write (yellow line), and the first cycle of the "1337 ping code" write into CPU DDR (red line). Which is ~16 cycles where the interface is running at 250mhz means 4ns * 16 cycles = 64ns - not particularly great but workable. Meaning the search for the remaining ~800ns remains .... inconclusive.

.... next up will investigate writing directly into the CPU`s L2 cache from the fpga, thus avoiding any shenanigans involving the CPU`s DDR.

Sunday, 27 April 2014

Market Data Backup`s

http://www.costar.com/webimages/bluray.jpgBeen doing some housekeeping, part of which is cleaning up and standardizing HDD arrays. One of the sucky things about RAID is you need each physical disk to be the same size otherwise the space just sits there empty. As such needed to shuffle a bunch of data around and realized its time to put all my ITCH data into cold storage - have not touched it for quite some time.

All up its about 1.5TB worth compressed, decompressed probably 5-6TB or so. There`s a few options out there for this level of data backup

1) disconnected HDD
2) cloud backup service
3) burn optical disks

Option 1) would be the most obvious choice. HDD are cheap, transfers are fast. In theory HDDs are less robust than optical media due to moving parts/bearing failure vs non-moving part.

Option 2) just dosent work for TBs of data. Amazon Glacier costs ~ $0.01c / GB / month which isnt so bad, the killer is the bandwidth cost of $0.20c / GB to retrieve the data. Pushing total cost significantly above the cost of HDD or optical. Guess if you factor in redundancy into the HDD or optical storage size and Amazon isnt that bad. Still just not comfortable with a fixed monthly cost vs a onetime expenditure. 

Which leaves Option 3) Optical, meaning BluRay 25GB & 50GB disks. The media costs are quite a lot less than a HDD cost -> 50 disks @ 25GB each is around $30 == 1.25TB. But its a complete pain in the ass to burn that many. Ended up writing a script to almost automate the whole, copy, burn, verify cycle with a single disk taking around 3H in total.

In total its taken about a month to burn the dataset. Some days would get 5-6 disks done, others 1 or 0. But there`s some sort of piece of mind I get from having data backed up in ROM. where only sunlight and a hammer can delete the files.






Wednesday, 22 January 2014

HDD failures


 http://blog.backblaze.com/wp-content/uploads/2014/01/blog-fail-drives-manufacture.jpg

Hot off the press with some excellent numbers on real world disk failure rates. As someone who`s consuming and storing a substantial amount of TB`s this kind of data is invaluable. Full post can be found here

Am using WD green drives coupled with SSD on ZFS thus you get alot of bang for buck with large cheap disks and fast cache/working set (SSD). So far have only lost 1 disk and luckily it failed during burn-in.

Keep in mind this data for consumer grade disks, not your typical HP/DELL/IBM crazy ass expensive servers.

In short: seagate barracuda disks sucks - beware.

Sunday, 19 January 2014

Ultra Advanced FPGA Development


This was just too amusing to pass on. Having a cooling problem with a 10G fpga board which required some hardcore engineering skillz to solve. Essentially the board has no fan and there is no case resulting in a rather poor airflow and toasty chip. To make it worse there was nothing to rest the fan on thus the below highly engineered solution was taken to the business for approval.




Full purchase order is as follows:

- x1 1 12V fan
- x2 alen keys  (one required to be quite long)
- x1 twisty tie
- x1 TAPE! *this is special thermal tape, forget the name but it happy survives 1-500deg C*

The truly novel part of this is invention is the unqiue application of a twist tie to secure the alen key to the heatsink, via the heatsinks own design.

NOTE: technically tape is not required but psychologically, tape is always required.

Sunday, 29 December 2013

diworsification



http://digitalleaders.co.uk/wp-content/uploads/2013/01/10859543-christmas-tree-from-digital-electronic-blue-circuit-and-lights-300x300.jpg
Alas its that time of the year again and dam ... so few posts this year, many apologies. Its been a good year financially, reached a personally satisfying PnL across all revenue streams but holy crap its been hard. Running 3 gigs at the same time, but thats not perverse enough, so I made sure 2 of them are on completely inverted time zones... literally, JST ( Japan Standard Time) = EST + (almost) 12H ... AM == PM. Luckily everyone involved is understanding.

Have always had some weird sleeping patterns, starting in high school when would do caffeine fueled, goa psy trance blaring, double all nighter code/hack/phreak sessions, starting Saturday afternoon, thru Sat night, thought Sunday, all the way through Sunday night thru Monday morning, take a shower and go to school. No doubt school is pretty useless at that point but meh. Sadly the body at 15yo vs 30yo are two very different things with my body creaking, cracking and complaining all god dam year - holy crap im getting old!

One of the best lessons this year was how much different the end-of-year PnL looks with multiple revenue streams. Non-(time)-linear PnL is such a beautiful thing. Thus CY2014`s first goal is replacing the consulting revenue with product revenue. The consulting group I worked with are a great set of people with minimum bullshit but at the end of the day, once the billable hours stop, so does the cash thus keeping you on this perpetual trade of time for money. The plan is developing a product targeting the non-financial sector and let it grow / shrink organically or just be happy in a small niche. Ultimately, the hardest problem is finding a niche that returns well but is not a huge time sink.

On the trading front need to push my comfort zone, building out a different class of HF strategies. No doubt will loose money for a while, but so far my trading experience is, its alot like learning to ride a bike - you can not teach it, you just need to do it and learn from the mistakes and gain intuition. But here`s the trick, the experience you gain isnt so much from watching the market, you can watch it all day long and get no where. The most important part is competing with other players, analyzing them, sizing them and of course fighting with them. So while they (and everyone else) are beating the absolute shit out of you, you`ll realize in that pool of your own blood some subtle perspective of the interaction that completely changes your game.

Happy Christmas & New year !

Friday, 15 March 2013

Tale of two sockets




Once upon a time in a far away place there was a socket... 

The term socket has different interpretations, to those on the left in the network world it usually refers to a Berkely sockets aka socket(AF_INET, _ ... ) send/recv. 

To those on the right in LSI land it means a place where you plug in an ASIC. Today we`re talking about the latter, the space where you plug in (usually) a CPU.

Recently faced an unusual timing problem - a bit of code was taking negative 1 to 10 usecs. Under other circumstances would have been delighted the code was running so fast yet nature being the stubborn SOB that it is, prohibits such things... atleast outside the Physics department.

It got me thinking, questioning every aspect of the code and hardware with one nagging question popping up.

Is the cycle counter between two sockets synchronized?

Its well known and documented that in a post Nehalam age the intel cycle counter is rock solid, invariant to power states, between cores and also immune to turbo boost, the latter being a pain the ass if you really want to count cycles. Why does it always tick over at the frequency rate listed on the box ? seems the Nehalam intel designers decided rdtsc & friends should be based on a clock in the creatively named "UnCore" which sits way out near the IO logic, thus centralized and the same for all cores.

rdtsc on a single cpu socket machine has been a trusted friend since the good ol Pentium 90 days but never tested its behaviour between cpu sockets. Yet is it syncrhonized between two cpu sockets? the short answer is, yes it is very well synchronized between cpu sockets as mentioned in various placed. However ... being the skeptical person I am - don`t believe writings on teh itnerwebz (such as this blog!) and so.....  a test.

The test is simple, run 2 hardware threads(HWT) on 2 different sockets. Each HWT will update a shared cycle counter in memory. The expected result is, the local HWT cycle counter should *always* be greater than the memory value.

Why ? example case 1

Cycle |         HWT 0     |  HWT 1
   0  |  sample           |
   1  |  write to memory  |
   2  |                   |    sample
   3  |                   | write to memory

Or the perverse edge case

Cycle |         HWT 0     |  HWT 1
   0  |  sample           |      sample
   1  |  write to memory  |   write to memory
   2  |                   |

**1 - in the smallest font possible hoping no one will read this... the test fails when the 64b cycle counter overflows
**2 - yes im completely delusion to think the above is anything close to the voodoo magic that goes on inside a real intel cpu

Presuming the cycle counter is synchronized  then every time we sample the cycle counter, the counter will be higher than whats written in memory - because whats in memory is a past cycle count, which by definition is less than the current count.

in code  (g_tsc is the shared memory value)

        u64 memory_tsc = g_tsc;
        u64 tsc = rdtsc();
        s64 dtsc = tsc -  memory_tsc;
        assert(dtsc >= 0);

This works if the two cycle counters (one for HWT 0, one for HWT 1) are synchronized and fails otherwise. There is a problem tho... if we run this as is, it fails.

Theres 3 (possibly more?) reasons why

1) cycle counters are not perfectly synchronized
2) x86 micro architecture is re-ordering the memory operations
3) compiler re-ordered the load.

Checked the asm and 3) is not true thus suspect 2) as x86 is notorious for its (very) relaxed memory ordering so we modify the above slightly by adding an lfence instruction. What this does is serializes all memory loads (but not stores) with respect to the current instruction fetch stream - creating a barrier of some sort.

After this is added the program runs perfectly, thus conclude the cycle counter between sockets is perfectly synchronized, or closely synchronized.


Now we know their closely synchronized, the question is how close and does it explain my negative 10usec. To do this we run a histogram on what that delta between the cycle counter and the memory value which shows a range of 100-200 cycles, which at 3ghz is at most 66.66nses, likely the QPI cost between sockets. All in all thats pretty tight.

The down side, this was classic tunnel vision and not the cause of my negative latency number. The real cause was something way up the stack and far to embarrassing to write about here.!

Saturday, 16 February 2013

The Professional Trader







The definition of a professional trader is, "professional" meaning a method to support oneself financially and secondly "trader" describing the type of activity, thus a professional trader makes their living in the markets. 

According to our most trusted news site, CNN. The average salary for a freshly minted CS grad is $60,872 which means:



- $5,072 / month
- $1,268 / week
-   $253 / biz day

Now if we break that $253 / day into HF style trades it looks like

@ 1 share/ day     -> $253 net gain / share
@ 10 share / day  -> $25.3 net gain / share
@ 100 share / day -> $2.53 net gain / share
@ 1000 share / day -> $.253 net gain / share
@ 10,000 share / day -> $.0253 net gain / share
@ 100,000 share / day -> $.00253 net gain / share

At 100K shares / day with a *net* pnl per share of 0.253cents (less than a penny) we could make it work. At 10k shares / day with 2.53 cents / share profit also in the realm of feasibility. Note that 25 cents / share profit for every share is possible but think your pushing it to expect that consistently day-in-day-out in high volume. Remember this is per share, not per trade.

Now lets say we choose 10k shares @ 2.543cents / share net pnl, e.g. we`re trading the spread+a tick or two and our edge is say 10%. Meaning 60% of our trades make that 2.543cents, and 40% of the trades loose 2.54 cents, NOTE: in reality that kind of equally priced distribution abs(win) == abs(loss)  is highly unlikely. 

Result is we actually need to trade

   $254 = k * (0.6 * 0.0253 - 0.4 * 0.253) where k = 50K, so our 10K shares is now 50K shares.

Counting in trades lets say we`re trading lots of 100 shares, meaning we`re doing 500 trades / day. And lets say the market is open 6H / day which means 500 trades equally spaced is 6 H * 60Min * 60Sec / 500 trades = 43.2sec. e.g. thats one trade every 43.secs for 6 hours but actually its half that because you need to enter/exit the position, so we`re doing 1 trade every 21.6 seconds.
.
.
.


It starts to put this HF trading thing in perspective -> 1 trade every 21.6secs that makes 2.5 cents/ share on average, to makes $250 / day or $60K / year..... a nice healthy dose of reality. 

What im doing now is diversifying my income strategies. Some of this is trading strategies, some un-related and others quite frankly you just would not believe. Having 20/20 hindsight theres so many things I would have done differently, but easily the most important would be to build non-trading income strategies, that cover your bills when the trading hits a rough patch - because when your starting out its just one endless bumpy rough ride.