hacking NASDAQ @ 500 FPS: Under the hood of a Black Box

Building Mid-freq strategies is alot more involved than HF/UHF strategies, atleast so far. With HF/UHF your looking for a simple pattern with a simple data transform that`s consistent enough you can build a strategy to exploit. With the mid-freq strategies its still a simple pattern the difference is the abstraction level and greeks to the trade are more complicated - at least thats what it seems. e.g no longer looking at the very short term micro inventory/information/latency arbitrage opportunities and instead stationary patterns in highly abstracted and transformed data sets.

My mid freq strategies are not going well, it takes alot of time and I put a hard deadline of end-of-the-month to have something... which gets closer each day. Thus its almost certain will be pulling the machine from colo, to regroup, lick my wounds, get some sleep and then march forward with a modified approach... just cant do this alone starting from 0.

So less ranting about my (lack of) PnL and more on the tech side. Here`s a logical block diagram of my system.

Pretty simple eh?

First up NIC`s

NIC0 -> ssh / management interface
NIC1 -> Order Enter
NIC2 -> not used
NIC3 -> Market Data

Next is HWT`s

Conventional wisdom says you should disable hyper threading as it has significant impact on the performance of the other HWT for the core, which can be true. Hyper threading works by having separate "contexts" e.g. register block & program counter in hardware but sharing the same execution pipeline. Similar in concept to a time sliced operating system where each process/thread has its own copy of all user-land registers which are swapped in at the start of the threads time slice, the thread executes for a set period of time, the registers are copied back into memory, and a new thread is swapped in. This allows multiple programs to share a single CPU and maximize the utilization of the CPU. HW threads work in a similar way but at the ISA (instruction set architecture) level ontop of a processors micro architecture.

The theory for both time sliced OS and hyper threading is, for a significant % of time the execution units are idle as the program is blocked waiting for IO. Thus some other program can utilize the hw resource while waiting for the blocked IO to complete and you get higher execution occupancy & more throughput... but at the cost of increased latency.

OS Example:

while a thread is blocked waiting for Keyboard/Disk/Network input, some other thread runs

HW Example:

a memory read missed L1/L2/L3 and has to be fetched from DDR (100cycles), some other program runs for those cycles.

Have been coding for 8 core asynchronous systems since 2001 so designing for wide processing is quite natural these days - have suffered that transition pain. Thus have plenty of process/threads but not enough cores, so have to eat the latency cost and enable hyperthreading.

Short description of each HWT. All processes/threads are locked to their respective HWT.

HWT0

This is the general purpose, everything runs on this. To linux the system looks like a 1 HWT machine. bash sshd etc etc.

HWT1:

nothing pre-defined. depends what im doing with the machine for what is assigned to this. e.g. live strategies, back testing, backup/crunching.

HWT2:

FIFO scheduled (e.g. not time sliced) for all strategies to run. Processing is setup so 1 cycle of a strategy is run, the round robbin(via linux scheduler) to the next strategy and so. Can be dangerous as the strategies can effect each other but the core strategy logic is usually very simple and light.

HWT3:

For HF/UHF the amount of brute force number crunching is not so high, thus a single HWT is sufficient. The thread has a job queue where anyone can submit something to be crunched.

HWT4:

Market Data Feed handler. You might ask why only 1 HWT for this? The answer is the more queue`s you add the higher the latency. My system only has 1 Queue and thats the Socket`s Rx Buffer which is massive. The 2nd answer is, i`m not keeping a book for all ~6.5K symbols on nasdaq thus dont need the additional throughput. As mentioned way back in 2010, the key here is extremely fast trivial reject`s to filter out all the crap you dont need.

HWT5:

This is the disk io core who`s sole purpose in life is to copy blocks of shared memory to the SSD.

HWT6:

The Gateway + Active Risk checks. This translates internal order requests (new/mod/can/exe) into native protocol versions and performs basic risk check / position management / fat finger checks before sending it into the market. Gateway or OMS as some call it has hooks to external programs which can enable/disable the sending of orders. The risk checks are minimal as its on the critical latency path, thus the more elaborate checks are done passively post trade.

HWT7:

Networking utils / passive risk checking. Part one of this HWT is capture and logs everything on the wire in all directions going everywhere, think NSA style layer2 snooping - yes I see you knocking on the door 192.168.42.1. The other part is to digest and analyze the captured data pseudo realtime. There are soft latency limits here, ideally all these functions would be running on an independent machine but... didnt want to spend the cash for that.

.. and so the quest continues as digging thru terrabytes of data, racing a 500HP golf cart on the screaming edge of technology down some sketchy back alleyway in Hong Kong... is so my thing :P

5 comments:

danice12 April 2012 at 06:58
This comment has been removed by the author.
danice12 April 2012 at 07:00
The cost of latency is huge if you use HT. If I use one thread (spin-loop) per hardware core, then I get a such results:

4 cores, 1PUB->3SUB:
-----------------------------------
94718018.1454465 ops/secs
94116465.9894464 ops/secs
84562782.2282857 ops/secs
80818355.8960223 ops/secs
91871795.9711155 ops/secs

but with HT when I have a more active threads than hardware cores (4 cores/8 logical CPU's), the speed drops:

8 logical CPU's, 1PUB->7SUB:
-----------------------------------
22610550.8970284 ops/secs
23739017.0448516 ops/secs
24338574.8888214 ops/secs
23703615.6784247 ops/secs
23613373.5128002 ops/secs

You should also pin a thread to the core - sometimes cache can jump between the cores and it strongly increases latency.

A second thread with HT can be blocked by an active thread. You can use this asm to limit such situations:

asm volatile("pause\n":::"memory");

regards,
daniel
NBBO2OBBN12 April 2012 at 07:49
Thanks for the comment + stats. Yes HT reduces performance, and yes oversubscribing threads to cores will also reduce performance. And yes busy loops are the worse case test for hT shiftiness - no IO. Im certainly not going to defend the suckyness of HT.

My rational is, on a limited dollar budget I choose to take the hit of HT over
1) a time sliced kernel schedule
2) using the kernel to scheduler via blocking events
3) a one-huge-bad-ass-program approach that consumes/internally schedules everything.

comments are

1) costs atleast 50us on your stock linux kernel.
2) is brittle and less deterministic as one process/thread can severely impact the latency of everything.
3) my systems are a collection of independent programs aka "the unix way"

Wasnt aware of the pause instruction, thanks!
danice17 April 2012 at 08:14
Do you want to completely give up the HF/UHF and go to Mid-freq? But what do you mean by "Mid-freq"? 5 sec, 5 min??
NBBO2OBBN17 April 2012 at 08:37
For my setup UHF/HF scalping dosent work due to latency, commissions or both. So yes no longer looking at these kinds of strategies... sucks major ass.

Everyone has their own definition of high/mid/low freq. For me its, any trade where manipulation of your order in the queue has little impact on the PnL.

Note: only a member of this blog may post a comment.

hacking NASDAQ @ 500 FPS

Thursday, 12 April 2012

Under the hood of a Black Box

5 comments:

fmadio 10G packet capture

Blog Archive

About Me