hacking NASDAQ @ 500 FPS: December 2009

Thursday, 31 December 2009

rock on intel

Ok now that all the new year boozing is done, lets get back to it. First up, a new intel NIC was put into Machine B, so we`re now

Machine A (Intel 82573L) Xeon 2.6Ghz
Machine B (Intel 82574L) i7 2.6Ghz

Thus the Machine A is a generation or two behind Machine B in both CPU and NIC but thats what we have to work with.

First up is the Send histograms, in both directions A->B and B->A

Machine A send

Machine B send

As you can see Machine B kicks Machine A`s ass by the tune of say, 5-600ns. How much of this is NIC generation and how much is CPU generation is unclear, no doubt both contribute.

And on the other side, Recv first B->A and A->B look like(below)

Machine A recv

Machine B recv

And again we see Machine B kicking Machine A`s ass not quite x2 improvement but not far off!. In anycase it looks far better than the Realtek histograms which were... just abysmal and we`ve now got a baseline to work with.

Next up is looking at the round trip latency meaning A->B->A which assuming the network switch dosent suck(too much.... its old) we should expect around 1300ns(A send) + 500ns(B recv) + 750ns(B send) + 900ns(A recv) totaling in ... around 3450 round trip - quite a long time really.

Lets see what it clocks in as.

Tuesday, 29 December 2009

sucky hardware+drivers

What`s up with the bizzare latency? Turns out its a combination of hardware + driver, most likely a shitty driver. Previous tests used a Send(Intel 823573L) and Recv(Realtek RTL811/6168B) thus the logical question is, what happens if we switch send and receive?

First up is Send(Realtek) (above) which is still a little troubling, as It looks a bit like the Send(Intel) throttle @ 1msg/10us in the previous post. Two double spikes, one where you`d expect, and a second off out in the woods having a tea party or something. As we see this on 2 different machiines, it llooks suspiciously like something in the IP stack or NAPI... somewhere between the NIC and sendto().

Next is receive(intel) and....

...... tight, real tight. None of this bullshit spread we saw with Recv(Realtek). Very low number and a very tight spread, this is good.

And a nice closeup. As we can see, its peeking around 900ns, 2400 cycles @ 2.66Ghz(recv`s machine clock) which is pretty good considering this is a user program+socket user lib+network stack+driver. This however is just the time between packets, its NOT the latency between machines, its the latency from one message to the next on a single machine - a very different but related number.

Machine to machine latency requires both machines send and receive, so will have to wait until a less sucky network card+driver is put in the other box.

network buffers

Next interesting part is looking at the interconnect fabric, which in this case is 1Gigabit Ethernet NIC x2 + Switch, using UDP at the protocol level. Test is conducted using a stock 2.6.30 kernel + socket library and standard Rx/Tx socket buffers. What were testing here is the time it takes to send each message in its own UDP frame, so sendto() on each ITCH message. The results are quite (in a bad way) surprising.

Here we see that send is well behaved with an average around the 1200-1300ns mark with a fairly small standard deviation. Translates to 3300 cycles @ 2.6Ghz(test machines freq) which is alot... but considering how many libraries and theres a kernel call in there not that unreasonable.

OTOH recv is pretty bad. If send send`s at full rate, then it drops a huge clusterfuck of packets so we throttle send to 1 message every 10,000ns(10us) enabling send to receive every packet.

Thus on the send side,(above) we get the expected peek around the 10us mark, with a fairly small stdev, but whats up with that second spike? In contrast, the recv histogram is a mess.

There`s some clear reciprocal send behavior here - the 2 spikes around 10us, but the spread is all over the map. Obviously something we really don't want when latency is so critical.

Using 1msg / 10us isn't a luxury we can afford, so what does it look like when send runs flat out?

Answer: quite interesting. The huge spike is centered around 500ns, 1335 cycles @ 2.67Ghz(test machine freq) which is promising, however the large flat chunk from 5000-10000ns would appear to be the problem. e.g. why its taking so long is hmm wtf? Looking at the network stack stats, the send machine drops 0 UDP packets while the recv machine drops a ton. On the order of about 40% of the message volume so its pretty certain this is an Rx problem.

Changing the hisogram to graph messages dropped against time(above) doesn't help much. Basically the same graph, a little flatter in parts but not very informative. Digging a bit deeper, plotting the ratio of # packets dropped for each time bin, the above graph makes more sense.

Green = sequential

Red = 1 missing packets

Blue = 2 missing packets

Orange = 3 missing packets

White = 4 ore more missing packets.

So its easy to see theres little correlataion between recv time and the number of packets dropped, which... is not what I expected. e.g. longer recv takes, the higher the number of dropped packets. The one unknown factor here is UDP packet re-ordering, as by definition order isnt gaurenteed. However, as the network topology is trivial, nic->router->nic find it hard to beleive this would be a major factor. Next question is why this is occouring and what can be done about it.

Monday, 28 December 2009

scratchin an itch

Next up is converting to ITCH4 to get rid of all those nasty ascii -> int conversions and the results are, quite surprising. Book update goes from 98 cycles -> around 60-61 cycles! so full book update for MSFT now clock in around 15ns @ 4.1GHz. Things are starting to get interesting!

Next up is using the shitty SSE isa to speed searching the 8 entry set. First attempt on order add resulting in a *slower* implementation, clocking in above the 15ns mark. Why so? because 99% of order adds are placed at entry 0. e.g. the scalar code only does 1 loop iteration. Its not really an iteration either - gcc unrolls it completely, so we`re touching way less memory(2B vs 16B) and intels scalar integer code runs dam fast, faster than the 9 or so SSE instructions. Thus scalar wins.

However... we already know if the set is empty or not, via the 20b bit array. If its 0 then its empty, 1 for non empty(1-8 entries). Thus we can use this value to trivially accept an entry and not even read the 8x16b WayID - just write, woot! That coupled with an SSE free slot search, we shaved about 6 cycles off the the amortized processing time. Considering order add is only 1% of all message volume, its a significant speedup and we`re down to 56 or so cycles which is 13.7ns @ 4.1Ghz to update the book.

Next is order del / order search, SSEing the order delete code. Which unfortunuately didnt improve overall performance much. Best result is 53.2 cycles, so about 12.9ns @ 4.1Ghz..However at this level OS jitter is a significant problem, making consistant timing values difficult - linux is a dog for twitch, realtime problems.... Theres still plenty to get that number down, mostly in the outer processing loop but its fast enough for now, next interesting task will be looking at the full latency from the exchange, thru the os stack, black box and back. Shall be interesting!

Sunday, 27 December 2009

trivial reject

Ok down to business, problem is there's a 64b unique index(OrderID) we need test against so a straight up array aint going to cut it. Next choice is std::map<> with a key of orderID storing a pointer to the order structure. Please note we are testing against 700M / messages on the MSFT symbol, so reject rate is 99.7% of all messages. Also note this is ITCH3.1 so we need to do ascii->u64 converion on the orderid which is reflected in the perf numbers.

First up std::map<> which ... fails to run. Allocats a metric ton of memory and barely gets beyond an hour or two of trading. assuming theres enough memory.. clocks in at 553cycles / message or 135ns@4.1ghz.

Next up is an unordered map, as we dont need it sorted and.. same thing. metric ton of memory and fails to complete. assuming you have the memory, it clocks in at 285cycles/message or 69ns@4.1ghz.

Which is hmm not bad if you have 64GB of RAM and your run of the mill programmer, but we can do better. The trick is we have information that can`t be expressed to STL, and that`s that orderID only uses 34bits of data, and the working set is roughly 18k orders so lets exploit it using a bitmap array + 16 way set, similar to a hardware set associtave cache.

How it works is use the 20 LSB as an index into a 1bit array, resulting in a 128KB array (2^20 / 8 /1024). If the bit is set, then theres valid entries in the set, otherwise theres nothing there - trivial reject. If there is entries then its a simple 16 entry search space which is easily parallelizable.

Thus going from OrderID -> Order_t* is nice fast O(1) or O(16) depending on how you look at it with a nice fixed memory footprint. Problems arise if the set overflows and theres data loss, but the highest occupancy is a wopping 3 entries for MSFT so in practice we drop down to 8 entries in the set.

Total footprint = (2^20 / 8) + (2^20 * 16 * 8) ~= 128MB. However as stated in optimization 101, keeping the footprint down is critical and we can reduce this further. Given each set entry, we know its lower 20bits are unique, and the same for all set entries. OrderID has 34 significant bits, so that drops down to significant 14bits thus we can use a 16bit value in the set search space instead of the full 64bit, woot.

Footprint = (2^20 / 8) + (2^20 * 16 * 2) ~= 32MB

As the data indicates a max 3 entry occupancy, its safe to reduce to 8 entries per set leaving

Footprint = (2^20 / 8) + (2^20 * 8 * 2) ~= 16.125MB

Which gets close to fitting entirely in the L3 cacahe! The numbers clock in , averaging 108 cycles / message, meaning 26ns / msg @ 4.1GHz for a full book update on a single symbol. Keep in mind this is mostly just the reject code at play, its straight C(no assembler/intrinsics), were dealing with ITCH3.1 and the book update code is pretty shitty and eats into the average. So .... theres still plenty of milage to get that number down!

Saturday, 26 December 2009

book update latency

The code for updating the book is quite simple, the interesting question is how fast can this be done as it adds latency to the trading algo, and this space is all about latency. My initial thoughts is you want to update the book for every symbol for every order operation on a single machine, thus need a throughput of around 100k msg / second which translates to around 10,000ns or 41000 cycles @4.1ghz per message... which it a huge chunk of time and no problem to process. What makes it interesting is optimizing for latency.

Why is latency so important? because the response time from when the exchange sends out the order operation, to the time the exchange receives your order request is what matters. So from the exchange, over ethernet to your servers NIC, through the linux network stack, into your black box trading algos, then back through the linux networking stack, to the ethernet card, over the wire and back to the exchange. Building the book is part of the "black box" thus the time it takes to update the book is important. How important? not so sure at this point but if your dumb arse code is taking 10,000 nanosec to just update the book, your going to get your ass kicked in the market.

How to minimize the book update latency? pretty simple really - keep 1 symbols`s book per cpu, or just a few books per cpu if your say, pair trading. And duplicate machines for however many symbols your trading on. Why does it help? because on say an intel i7 processor, you hve 32KB x2 (instruction/data) L1 cache, 256KB L2 per core, and a shared 8MB L3. When execution latency is paramount, the L1 cache is god, L2 cache is close to god, L3 cache is your preist and DDR is death. Translation - because the memory access patterns are fairly random keep the working data set as small as possible, thus only process a few symbols on a single core/cpu socket so all the data stays on chip, in the cache.

What are the numbers? an L1 miss is reportedly 10cycles(2.44ns @ 4.1Ghz), L2 miss is 30-40 cycles(9.76ns @ 4.1Ghz) and if you miss the L3 and go to DDR your in the 100`s of cycles say 400 cycles(97.6ns @ 4.1Ghz). However the intel engineers have been hard at work making shitty code run fast on their hardware, so its more complicated as data can be pre-fetched, hardware can schedule hit under miss, there`s TLB miss costs or god forbid... hitting the virtual page file. Moral of the story is to keep the memory footprint as small as possible, and access order as predictable as possible - this is x86 optimization 101.

So thats err... cool but how is it relevant? Simple. If your code is busy spending 1000ns processing the book for symbol X, and your trading on symbol Y, then you`ve just pissed away 1000ns of latency while symbol Y`s message is sitting in a buffer waiting to be processed! By that time your competitors already have an order on the wire en-route to the exchange and you - FAIL. Thus the problem can be re-stated as "the fastest way to reject messages" and only process what your interested in.

My first thought was, great just do a 64b compare on the stock symbol for all messages and we`re done, cool.. too easy. However its not that simple as everything revolves around a unique OrderID, meaning you only have the symbol on the order add message, and all other order types you need to work backwards from the orderid to decide if its a relevant message - a nice fun problem.

the book

After an order has been distributed to the market, the next step is to build the book, which is simply separation and sorting of orders into the bins of, buy/sell, price level, and order #. The result is a list of buy & sell price levels, with each price level having its own list containing individual orders at that price. Simple stuff, what makes it interesting is the sheer volume and depth of the book. Maybe its buggy but the book depth is about 1.5-2 orders of magnitude larger than what I was expecting. MSFT(2008/11) clocks in on average 500 buy levels, 1500 sell level and about 18K orders on the book at one time. Certainly a different picture on the left, which was my expectation.

So whats up with that? I think the answer is it only shows quotes near the last executed price, or some moving average + standard deviation and thus the order book is nice and small. What does it really look like? Heres the buy order book for MSFT sampled every 10ms, rather beautiful.

buy book sampled every 10ms. y axis $0 -$30

Note that the coloring is based on order count, at that price level, at that time. Its not symbol volume at the price level Pretty eh? bit like geology. Can see all the dumb arse (near)$0 orders, the various "oh shit sell" price levels(long horz lines) and the edge - where all the action is.

More detailed version shows shows from $18 dollar(bottom left hand corner) on up. Each vertical pixel == 1cent. Not sure what to make of the horizontal black lines. Assume its gaps in the price level or a floating vs fixed point calc thing... tho at these levels it shouldn't be a problem, or just a bug in my code.

from $18 1 vert pixel = $0.01, horz pix = 10ms

And.. it wouldn`t be complete without a sell picture, here`s the same $0-$30 scale. Anything > $30 is capped at $30. Depending on how you look at it, might be a cave, or mountain in the sunset. Its interesting the number of higher prices (long horz lines) is sufficiently denser than that of the buy book. Guess this is where most of the 500 vs 1500 price levels goes.

sell book vert scale[$0,$30] 1 horz pixel = 10ms

And an obligatory close up on the edge - fascinating stuff.

sell book vert scale from $18 1 vert pixel = $0.01, horz pixel = 10ms

Friday, 25 December 2009

interesting activity

the last post was randomly selected snippets of the market which was sometimes interesting, sometimes not. Was curious at what is going on at the highest order rate in the day which turns out to be.... just after the first 30secconds of market open. Around 10k (mostly add) orders are put on the books in a very interesting pattern.

As you can see theres this cresendo of orders which suddently flatlines. Its quite bizzare the # orders in every 1ms slice never exceeds 100 - double checked the data as its a highly artifical pattern. So whats going on here? my guess, as its almost entirely add orders, is this is an artifiact of the exchanges software. e.g. at exchange open time all particpants send their orders at the same time, which get queued at the exchange, and processed at a rate limited speed of 100 orders / 1ms, or 100k / sec (assumely per symbol). Once everyones initial orders are placed everything settles down and its business as usual. Whats the takeaway? probably if you flood the exchange with > 100 orders / 1ms its going to get severly clogged up, so maybe you can slow down your compeitors orders.

The seccond interesting point is a bit of an anti-climax, portion of the session with the highest number of executed orders, which turns out to be just after the 6H mark.

1 horz = 1ms

Nothing too interesting here. 2 huge spikes, followed by a chunk of activity. Remember, this is # orders executed not share volume executed. So its a lot of activity for 1ms slice maybe some big dumb algo snapping up a chunk of the book? followed by watcher algos thinking something bigger is going to happen. But... just guesses at this point.

a symbol

next step is to look at the traffic on a single symbol. Heres the same 1sec, 10ms, 1ms charts for MSFT. theres some signal in there somewhere but the signal to noise ratio is going way down. processing wise say a max of 1000 orders / sec means a whopping 1000us / message... which is a hell of alot of time.

1 horz pixel = 1 sec

1 horz pixel == 1 sec

breaking it into 10ms bins it gets increasingly chaotic such that there`s little if any sort of pattern. Probably the only clear thing is spikes are x10 the average, which im guessing is the hifreq guys pouring orders into the market. or alternatively might just be large trade activity broken into randomly sliced bins, with bursting orders, thus the large spikes.. in anycase need to break it out into price levels + volume to work out whats actually happening on these order spikes.

1 horz pixel == 10ms

1 horz == 10ms

drilling down to the 1ms level and at the number of orders issued, hard to find much signal and its basically noise.

1 horz pixel = 1ms

birds eye view

So what kind of message rates is this? For 700M msg / day, its not that interesting, peeking at around 100k messages / second translating to 4100 cycles @ 4.1ghz or 1000ns / message ... which is *alot*. Heres the plot of a single days order add/del/execute/cancel for all symbols. Whats interesting is how spiky it is with some very clear patterns - assuming "time bins" from VWAP / TWAP algos.

Green = Number of Order deletes
Blue = Number of Order Adds
Orange = Number of Order Execute
Red = Number of Order Cancel

1 horz pixel = 1sec of orders

Obviously quote data is an order of magnitude larger than executed trades so breaking out just the order exeuction data looks somthing like (note: vertical scale is an order of magnitude less). Which at a glance dosent have the same cyclic time bins.

1 horz pixel = 1sec of executed orders

Drilling down to get a more mico level view the next shot is sampled every 10ms, e.g. every x pixel is a 10ms slice and runs for about 11minutes. What occurs around the 20000ms mark is quite interesting as it appears some algo gets triggered and a ton of orders flood the market. Keep in mind these are graphs of all symbols, with vertical axis measured in message count NOT aggregate order volume. Whats fascinating is, (i assume)the spikes before 20,000ms are some other algo/symbol/firm(s) and the spikes after is some other algo/symbol/firm(s), in essence a digital signature.

1 horz pixel = 10ms worth of orders

drilling down further to the 1ms level it gets more chaotic and extracting some signal from the noise gets harder. This chart shows each horizontal pixel as 1ms worth of orders, which is the resolution limit of ITCH3.1 spec. ITCH4 time stamps are at the nanoseccond level so maybe the same picture get clearer with the increased accuracy... or maybe not. The spikey patterns are still faily visible, yet what they represent could be VWAP/TWAP or might even be how the exchange processes orders. Probably the former but none the less its a clear signature of some sort.

1 horz pixel = 1ms worth of orders

Its kind of weird, its like dejavu looking at these graphs, as its similar to the work Ive done with hardware performance trace logic - think a cycle level logic analyizer which dumps out counters every 100ns or so into dedicated ddr/ram. You end up with this massive chunk of data that needs analysis on wtf is going on, and how to improve chip performance - fun stuff.

the feed

Looks like there`s a variety of datafeeds to receive quote and tick data for trading US equities, with the primary feeds being

NYSE Arca

NASDAQ TotalView(ITCH)

For those curious the exchange provides free sample data

Arca Multicast

ITCH

They provide full historical data but costs in the region of $1000 USD/month or 1 user. The question is which feed to choose?

The ArcaBook multicast data is just shockingly bad. its a raw tcpdump which is great but seems the exchange people kept the udp socket buffer to their default 64KB size.. so when the traffic bursts it drops a awfully large number of packets. As ArcaBook protocol has a sequentially increasing sequence number(yay) thus easy to calc the sample data drops 8.2% of the packets! Which translates to 726K packets(not messages) dropped out of around 8.8M total packets... how can you verify bookeeping with this? Yes ArcaBookTCP dump should contain everything however dropping that many packets on a low < 100KB/s average stream dosen`t inspire confidence.

OTOH the Itch3.1 sample data weighs in at around 17GB raw and about 700M msgs for a single trading day (2008/11), volume levels that are actually interesting! Also major points for reporting trades on the same feed - Arca is pure quote data. The spec dosent provide a sequentially increasing number to check against, and its tcp data so in theory is complete but the books add up and the volume is 2 orders of magnitude more. So... guess thats where the action is!

the seduction

Wow first blog... first post... this is awkward. Ive been following high frequency trading tech for a while now. Its one of those things which grows and grows on you from a purely technical point of view like one of those mosquito bug zappers which hypnotically attract insects. Latency problems, throughput problems, hardware architecture, crazy ass server infrastructure, cycle counting, bit f00king and so on are all finally getting interesting. And my current excursion into entrepreneurship result is the obvious but from an tech point of view, depressing result. Entrepreneurship is about solving peoples problems NOT technical problems.

Which is why high freq/algo trading is interesting, as from outwards appearance its yourself or a small group going out into the blood bath that is the market to brawl with your foes computer systems where the only thing that matters is the PnL. no marketing, no sales, no clients, no support, no investors and not really a product either its all about is your tech faster and smarter than the competition. Its in stark contrast to entrepreneurship, which is about going out into the market to discover peoples problems, building solution to these peoples problem, getting people to use your solution and then getting people to pay. Not to mention cultivating a people network of relationships, building a team of people and so on.

Trading looks like finding a weakness in the market, build a strategy/algo to exploit it then ruthlessly execute for as long as possible. So as an introvert, hacker and tech centric person im being seduced by the notion trading is focused on tech, numbers and bits. Thus this blog is about testing the hypothesis to see where it all goes.

hacking NASDAQ @ 500 FPS