Monday 28 December 2009

scratchin an itch

Next up is converting to ITCH4 to get rid of all those nasty ascii -> int conversions and the results are, quite surprising. Book update goes from 98 cycles -> around 60-61 cycles! so full book update for MSFT now clock in around 15ns @ 4.1GHz. Things are starting to get interesting!

Next up is using the shitty SSE isa to speed searching the 8 entry set. First attempt on order add resulting in a *slower* implementation, clocking in above the 15ns mark. Why so? because 99% of order adds are placed at entry 0. e.g. the scalar code only does 1 loop iteration. Its not really an iteration either - gcc unrolls it completely, so we`re touching way less memory(2B vs 16B) and intels scalar integer code runs dam fast, faster than the 9 or so SSE instructions. Thus scalar wins.

However... we already know if the set is empty or not, via the 20b bit array. If its 0 then its empty, 1 for non empty(1-8 entries). Thus we can use this value to trivially accept an entry and not even read the 8x16b WayID - just write, woot! That coupled with an SSE free slot search, we shaved about 6 cycles off the the amortized processing time. Considering order add is only 1% of all message volume, its a significant speedup and we`re down to 56 or so cycles which is 13.7ns @ 4.1Ghz to update the book.

Next is order del / order search, SSEing the order delete code. Which unfortunuately didnt improve overall performance much. Best result is 53.2 cycles, so  about 12.9ns @ 4.1Ghz..However at this level OS jitter is a significant problem, making consistant timing values difficult - linux is a dog for twitch, realtime problems....  Theres still plenty to get that number down, mostly in the outer processing loop but its fast enough for now, next interesting task will be looking at the full latency from the exchange, thru the os stack, black box and back. Shall be interesting!

No comments:

Post a Comment

Note: only a member of this blog may post a comment.