hacking NASDAQ @ 500 FPS: 2013

Sunday 29 December 2013

diworsification

http://digitalleaders.co.uk/wp-content/uploads/2013/01/10859543-christmas-tree-from-digital-electronic-blue-circuit-and-lights-300x300.jpg

Alas its that time of the year again and dam ... so few posts this year, many apologies. Its been a good year financially, reached a personally satisfying PnL across all revenue streams but holy crap its been hard. Running 3 gigs at the same time, but thats not perverse enough, so I made sure 2 of them are on completely inverted time zones... literally, JST ( Japan Standard Time) = EST + (almost) 12H ... AM == PM. Luckily everyone involved is understanding.

Have always had some weird sleeping patterns, starting in high school when would do caffeine fueled, goa psy trance blaring, double all nighter code/hack/phreak sessions, starting Saturday afternoon, thru Sat night, thought Sunday, all the way through Sunday night thru Monday morning, take a shower and go to school. No doubt school is pretty useless at that point but meh. Sadly the body at 15yo vs 30yo are two very different things with my body creaking, cracking and complaining all god dam year - holy crap im getting old!

One of the best lessons this year was how much different the end-of-year PnL looks with multiple revenue streams. Non-(time)-linear PnL is such a beautiful thing. Thus CY2014`s first goal is replacing the consulting revenue with product revenue. The consulting group I worked with are a great set of people with minimum bullshit but at the end of the day, once the billable hours stop, so does the cash thus keeping you on this perpetual trade of time for money. The plan is developing a product targeting the non-financial sector and let it grow / shrink organically or just be happy in a small niche. Ultimately, the hardest problem is finding a niche that returns well but is not a huge time sink.

On the trading front need to push my comfort zone, building out a different class of HF strategies. No doubt will loose money for a while, but so far my trading experience is, its alot like learning to ride a bike - you can not teach it, you just need to do it and learn from the mistakes and gain intuition. But here`s the trick, the experience you gain isnt so much from watching the market, you can watch it all day long and get no where. The most important part is competing with other players, analyzing them, sizing them and of course fighting with them. So while they (and everyone else) are beating the absolute shit out of you, you`ll realize in that pool of your own blood some subtle perspective of the interaction that completely changes your game.

Happy Christmas & New year !

Friday 15 March 2013

Tale of two sockets

Once upon a time in a far away place there was a socket...

The term socket has different interpretations, to those on the left in the network world it usually refers to a Berkely sockets aka socket(AF_INET, _ ... ) send/recv.

To those on the right in LSI land it means a place where you plug in an ASIC. Today we`re talking about the latter, the space where you plug in (usually) a CPU.

Recently faced an unusual timing problem - a bit of code was taking negative 1 to 10 usecs. Under other circumstances would have been delighted the code was running so fast yet nature being the stubborn SOB that it is, prohibits such things... atleast outside the Physics department.

It got me thinking, questioning every aspect of the code and hardware with one nagging question popping up.

Is the cycle counter between two sockets synchronized?

Its well known and documented that in a post Nehalam age the intel cycle counter is rock solid, invariant to power states, between cores and also immune to turbo boost, the latter being a pain the ass if you really want to count cycles. Why does it always tick over at the frequency rate listed on the box ? seems the Nehalam intel designers decided rdtsc & friends should be based on a clock in the creatively named "UnCore" which sits way out near the IO logic, thus centralized and the same for all cores.

rdtsc on a single cpu socket machine has been a trusted friend since the good ol Pentium 90 days but never tested its behaviour between cpu sockets. Yet is it syncrhonized between two cpu sockets? the short answer is, yes it is very well synchronized between cpu sockets as mentioned in various placed. However ... being the skeptical person I am - don`t believe writings on teh itnerwebz (such as this blog!) and so..... a test.

The test is simple, run 2 hardware threads(HWT) on 2 different sockets. Each HWT will update a shared cycle counter in memory. The expected result is, the local HWT cycle counter should *always* be greater than the memory value.

Why ? example case 1

Cycle | HWT 0 | HWT 1

0 | sample |

1 | write to memory |

2 | | sample

3 | | write to memory

Or the perverse edge case

Cycle | HWT 0 | HWT 1

0 | sample | sample

1 | write to memory | write to memory

2 | |

**1 - in the smallest font possible hoping no one will read this... the test fails when the 64b cycle counter overflows

**2 - yes im completely delusion to think the above is anything close to the voodoo magic that goes on inside a real intel cpu

Presuming the cycle counter is synchronized then every time we sample the cycle counter, the counter will be higher than whats written in memory - because whats in memory is a past cycle count, which by definition is less than the current count.

in code (g_tsc is the shared memory value)

u64 memory_tsc = g_tsc;

u64 tsc = rdtsc();

s64 dtsc = tsc - memory_tsc;

assert(dtsc >= 0);

This works if the two cycle counters (one for HWT 0, one for HWT 1) are synchronized and fails otherwise. There is a problem tho... if we run this as is, it fails.

Theres 3 (possibly more?) reasons why

1) cycle counters are not perfectly synchronized

2) x86 micro architecture is re-ordering the memory operations

3) compiler re-ordered the load.

Checked the asm and 3) is not true thus suspect 2) as x86 is notorious for its (very) relaxed memory ordering so we modify the above slightly by adding an lfence instruction. What this does is serializes all memory loads (but not stores) with respect to the current instruction fetch stream - creating a barrier of some sort.

After this is added the program runs perfectly, thus conclude the cycle counter between sockets is perfectly synchronized, or closely synchronized.

Now we know their closely synchronized, the question is how close and does it explain my negative 10usec. To do this we run a histogram on what that delta between the cycle counter and the memory value which shows a range of 100-200 cycles, which at 3ghz is at most 66.66nses, likely the QPI cost between sockets. All in all thats pretty tight.

The down side, this was classic tunnel vision and not the cause of my negative latency number. The real cause was something way up the stack and far to embarrassing to write about here.!

Saturday 16 February 2013

The Professional Trader

The definition of a professional trader is, "professional" meaning a method to support oneself financially and secondly "trader" describing the type of activity, thus a professional trader makes their living in the markets.

According to our most trusted news site, CNN. The average salary for a freshly minted CS grad is $60,872 which means:

- $5,072 / month

- $1,268 / week

- $253 / biz day

Now if we break that $253 / day into HF style trades it looks like

@ 1 share/ day -> $253 net gain / share

@ 10 share / day -> $25.3 net gain / share

@ 100 share / day -> $2.53 net gain / share

@ 1000 share / day -> $.253 net gain / share

@ 10,000 share / day -> $.0253 net gain / share

@ 100,000 share / day -> $.00253 net gain / share

At 100K shares / day with a *net* pnl per share of 0.253cents (less than a penny) we could make it work. At 10k shares / day with 2.53 cents / share profit also in the realm of feasibility. Note that 25 cents / share profit for every share is possible but think your pushing it to expect that consistently day-in-day-out in high volume. Remember this is per share, not per trade.

Now lets say we choose 10k shares @ 2.543cents / share net pnl, e.g. we`re trading the spread+a tick or two and our edge is say 10%. Meaning 60% of our trades make that 2.543cents, and 40% of the trades loose 2.54 cents, NOTE: in reality that kind of equally priced distribution abs(win) == abs(loss) is highly unlikely.

Result is we actually need to trade

$254 = k * (0.6 * 0.0253 - 0.4 * 0.253) where k = 50K, so our 10K shares is now 50K shares.

Counting in trades lets say we`re trading lots of 100 shares, meaning we`re doing 500 trades / day. And lets say the market is open 6H / day which means 500 trades equally spaced is 6 H * 60Min * 60Sec / 500 trades = 43.2sec. e.g. thats one trade every 43.secs for 6 hours but actually its half that because you need to enter/exit the position, so we`re doing 1 trade every 21.6 seconds.

It starts to put this HF trading thing in perspective -> 1 trade every 21.6secs that makes 2.5 cents/ share on average, to makes $250 / day or $60K / year..... a nice healthy dose of reality.

What im doing now is diversifying my income strategies. Some of this is trading strategies, some un-related and others quite frankly you just would not believe. Having 20/20 hindsight theres so many things I would have done differently, but easily the most important would be to build non-trading income strategies, that cover your bills when the trading hits a rough patch - because when your starting out its just one endless bumpy rough ride.

Thursday 3 January 2013

Round 2

Surprising how time flies, its been 6months since the end of Round 1 with Round 2 closing now. Its been one hell of alot of fun, albeit soul crushingly painful at times. Yet on the whole 2012 has been a good year - not much pnl but gained a ton of experience.

Have been pretty quiet about what I`ve been up to, but unfortunately the current strategies do not have enough juice to make it worth continuing. It sucks ass but thats reality thus need to re-group yet again, pivot and soldier forward.

Here`s the gross PnL over the last 6months, real exchange, real trades, real money.

Looks nice, grossed ~ $100+k in 6months, even impressive unless you trade HF yourself. In which case you notice the GROSS part and know how badly you get screwed by broker/exchange fees & tax, read the net pnl is only a fraction of this. The other realization is, for a HF strategy, for half a year this kind of gross pnl is quite frankly utterly shit.

But wait I hear, its a HF strategy, improve the latency! the latency man! the latency....... bake it into an fpga, build it for some network processor, hell go build that liquid nitrogen cooled 5Ghz overclocked monster machine and all your problems are solved... or maybe not.

Sadly better latency will not change anything.

So... show me the numbers! Lets measure the Tick2Trade register-to-register time, which means the nano(pico) second when the NIC`s register says there is a packet ready, to the nano(pico) second you bang the NIC`s register to say kick this packet. Visually it looks like.

When you talk about latency, most people are using the socket-to-socket number. If you have lots of cash to burn, then you can purchase a Corvil and use the wire-to-wire number - the real absolute latency. We didnt have that choice so the register-to-register number gives a close approximation, as the latency jitter in the HW is fairly consistent (at the micro second level atleast)

Without further latency lol... here`s the register-to-register latency, real trades, real money, real exchange.

Above plot is the histogram of all trades over the last 6 months, with nanoseconds on the x-axis. As mentioned above, this is the register-to-register plot, so add 1usec (Rx) + 1usec (Tx) for PCIe transfer + NIC asic shenanegins + 10G SerDes/PHY to get the wire-to-wire latency. As you can see the median internal latency ~ 4.5usec putting it at under 10usec wire-to-wire, with a 95 percentile @ 8usec, 99 percentile @ 10.25usec But why is it multi-modal? The answer... the code evolved/improved over time so if we plot the above and compare with the last 10 trading days we get.

Which you can see the median for the last 10 sessions is around 3.7usec (~6usec wire-to-wire), 95 percentile @ 5usec, 99 percentile @ 6.75usec with the multi-mode clearly due to different versions of the code. First version took a bit over a month to code up (feed+book+gw+strat+other), 2nd version a week or two (optimization), all running on standard hardware nothing fancy.

I could reduce this to 1 or 2usec wire-to-wire with an FPGA but its completely pointless. Why? because at this level of latency, its irrelevant if your half the latency of your competitors. As the latency jitter inside the exchange is orders of magnitude more than the delta between you and your competitors - remember this!

Net result? technology and speed alone will not win you trades, this game is over / drawing to a close. I`ve seen this on multiple exchanges and usually boils down to the exchange not having any clue at this level.

Maybe one day exchanges will wake up (or have their ass kicked) and hardware timestamp every packet at the BGP endpoint (exchange side switch), and use this wire hardware timestamp as the priority ordering into the matching engine... but that's a pipe dream / pure fantasy. Until then these kinds of ultra low latency trades is a battle of heavy weights slogging it out for 16rounds with 16oz gloves + appropriate grease to lube the exchange.

Whats next? its clear to me higher strategy IQ is the way to go, thus my machine learning rant the other day. But I need to pay rent so available for tech contract work - see skillset

For contract work / interesting projects / collaborations / etc etc hit me up at hacking.nasdaq@gmail.com

hacking NASDAQ @ 500 FPS