This Blog is Systematic: Systems building

Monday 15 June 2015

Systems building - deciding positions

This is the third post in a series giving pointers on the nuts and bolts of building systematic trading systems.

A common myth is that the most important part of a systematic trading system is the 'algo'. The procedure, or set of rules that essentially says 'given this data, what position do I want to hold or trade do I want to do?'

To some extent this is true. Any 'secret sauce' that you have will be in the 'algo'. All the rest of the code is 'just engineering'. There is no alpha from having a robust database system (though there might be from optimal futures rolling).

However the 'just engineering' is roughly 95% of my code base. And although there is no alpha from engineering, the consequences of screwing it up are enormous - there's an awful lot of potential negative alpha from say wrongly thinking your position in BP.LSE is zero, when you want it to be 100 shares, and then repeatedly buying 100 shares over and over again because you still think you don't have any shares at all, and who knows you might end up famous.

Because things always go wrong we do need to have the 'engineering': checks and balances when data isn't perfect.

A generic algo

Because peoples idea of what constitutes a trading 'algo' varies so widely, here is a generic function which shows some of the interesting inputs we could have in such a thing. You might want to put sunglasses on now, as they're some hideous colour coding going on here.

Optimal_position_instrument_x_t =
   f ( price_x_t, price_x_t-1, ....,
bid_price_quant_L1_x_t-3, ..., offer_price_quant_L2_x_t-7, ...
       some_data_x_t, some_data_x_t-1,
       other_data_x_t-1, ....,
   price_y_t, price_y_t-1, ...., some_data_y_t, ...,
       price_z_t-1,..., some_data_z_t-2,... ,
       position_x_t-1,
       a_parameter, another_parameter, yet_another_parameter, ...)

Interpreting this into english, the kinds of inputs into an 'algo' to work out the best position for instrument X could include any of the following things:

The current price of instrument X
A history of previous prices of X
The state of the order book of X
The last piece of 'some data' for X and optionally a history of 'some data' for X.
'Other data' for X for which the last observation hasn't yet been received.
The current price of instrument Y (and perhaps it's history), and some data for Y (and again perhaps history).
Some prices and data, for other instruments like Z, for which we don't yet have a current price or even the one before that
The last position for X
Some parameters

... and sun glasses off (unless you always wear them indoors),

First point to note here is that we can have a single data series (most commonly price), or multiple data series.

Thing 1 and thing 2 would be used for Technical systems. I'm ignoring the question of 'what is price' (mid of inside spread, last trade, last close, ...) which I've briefly addressed here.

Market data depth, beloved of high frequency traders and those using execution algos, is the stuff of thing 3.
Fundamental systems would have things of types 4 and 5 [See here to understand the difference].

Things of category 6 and 7 are obviously used for cross sectional trading systems. Think of a single country intra industry long/short equity strategy. You would never decide your position in GSK.L without also knowing all the data relating to AZN.L.

However they may also be cross sectional information in other systems. So for example my positions depend on the risk of my overall portfolio (another post to come...). That in turn depends on correlations. Which means every position depends, to a degree, on the price history of everything that's in my portfolio.

Thing 8 would happen in two cases. The first is when you have a state dependent algo. For example something with separate entry and exit rules. It would behave differently depending on whether you currently have no position, or are long or short.

This isn't my thing, by the way (if you'll excuse the pun). State dependence makes back testing difficult and unpredictable (small changes in the start date of your data series will determine your position history and thus profitability), slow (you can't use efficient loops as easily or separate calculations on to cluster nodes) and live trading erratic and intuitive (at least to my mind).

The second case is when you have to pay trading costs. If you have to pay to trade then it's only worth trading if your optimal position is sufficiently different from your current one. 'Sufficiently different' is something entire phd thesis have been written on, but again will be the subject of another post at some point.

Thing 9 would be common to all trading systems, and is hopefully uncomplicated.

What do they have in common?

Now a close perusal of things of category 1 to 7 reveals something they have in common. Unlike in an ideal world we don't always know the current value of an input 'thing'; where thing could be the price of X, part or all of the order book for X, some other data for X, the price of another instrument Z, or some other data for Z.

We can assume that we'll always know our position in X - if not we shouldn't be trading - and that barring catastrophic disk failure our parameters will always be available.

So for example, if I'm using the entire L2 order book then it's unlikely that all the bids and offers in the system were entered at exactly the same time. It's unlikely I'll have up to date earnings forecasts in a fundamental equity system; such forecasts are updated sporadically over the course of several months and certainly not minute by minute. Few things, except explicit spread trades, trade exactly at the same time. Similarly fundamental data for different countries or instruments is rarely released in one moment.

Imagine the fun we'd have if US non farm payrolls, GDP, interest rate decisions and the same for every other country in the world were all released at the same time every month. Do you think that would be a good idea?

I guess he's not so keen.

This is the general problem of non-current data. If you have more than one thing in your 'algo' inputs, it will also be a problem of non-concurrent data; for example I might have the current price of X, but not a concurrent price of Y.

By construction this also means that in the history of your things there will probably be gaps. These gaps could be natural (the gaps in traded price series when no trades happen) or caused by up-sampling (If I'm using daily data then there will appear to be big gaps between monthly inflation figures in a global macro system).

Dealing with non-current, non-concurrent and gaps in data accounts for nearly all the engineering in position decision code.

Defining non -current more precisely - the smell of staleness

Of course current or non-current, is very much in the eye of the beholder.

If you're a high frequency technical trader then a quote that is a few seconds stale is ancient history. A slow moving asset allocating investor might not be bothered by the fact that for example UK inflation came out two months ago.

Implicit in what is going on here is that you're going to take your data, and ultimately resample it to a frequency that makes sense for your trading system. Let's call that the 'natural frequency'. For a medium speed trend follower like myself, daily is fine. But that bucketing could be done on a microsecond level, ..., second, ..., minute, ...., hourly, ... and so on perhaps up to quarterly (that's four times a year, by the way, not four times a second/minute/hour/day).

Alternatively you sometimes hear traders talking in 'bars'. For example "I use 5 second bars" or one minute bars. Same idea; except of course they're looking at the Open/High/Low/Close within a bar rather than just a final value or average.

Anything that falls within the last time bucket at the natural frequency is effectively current. So if you're a fast day trader using a natural frequency of 5 seconds, and you haven't got an updated quote in the last 5 seconds, you would consider that to be a non current price. But if you had a quote 1 second ago then that would be fine.

In my specific system I use daily data. So before I do anything I down sample my prices to a daily frequency. I'd only consider myself to be missing a price if I hadn't got anything since the start of the current day.

Quick definition:

Down sampling - to go from a high frequency data series to a lower frequency series eg minute by minute data down to daily

Note that down sampling can be done in different ways; i.e. principal methods are to take an average over the slower time period, or use the final value that occurred.

Up sampling: The reverse. Eg go from monthly to daily.

Non-concurrent

We need to be very careful how we resample and align data when we have multiple sources of data.

To take a pointless and trivial example, you could be using dividend yield as a data point (dividend in pennys per year, divided by price in the same units). Assume for simplicity annual dividends, and no reliable forecasts of dividends are available. It might be 11 months since the last dividend announcement was made.

On that day the price was say $10 and the dividend was $1, giving a yield of 10%. Is 10% really the most reliable and up to date value of the dividend yield? No; because we have the current price. It's normal practice to say that the (historic) dividend yield will be the last available dividend figure, divided by the most up to date price. We'd up sample the dividend figure to your natural frequency for price, say daily, and then forward fill it. Then we work out what the yield would be each day.

Here is a more complicated example. Suppose we're trying to work out the real yield on a bond equal to nominal yield to maturity in % minus inflation rate in %. Further suppose we're relatively slow traders for whom monthly data is fine.

Nominal yield to maturity is available any time we have a price from quotes or trades. For simplicity let's assume we're trading slowly, and using daily closing prices from which we will derive the yield. Inflation on the other hand comes out monthly in most advanced economies (with the exception of Australia, where it's quarterly; and Greece where most national statistics are just made up). For the sake of argument suppose that the inflation figure for May 2015 came out on 5th June 2015, June 2015 will come out on 5th July 2015 and so on.

If we looked at these two data series in daily data space, we'd see yields every day and then an inflation figure on the fifth of each month. In this case the right thing to do would be to first down sample the yields to a monthly frequency, using an average over the month. So we'd have an average yield for May 2015, June 2015, July 2015 and so on. We'd then compare the average yield for May 2015 to the inflation figure for the same month.

Note that the average real yield for May 2015 isn't actually available until 5th June. The published date in your data should reflect this, or your back testing will be using forward information.

Let's take another example. Suppose you want the spread between two prices, for two different instruments. Suppose also that the spread isn't traded itself. You could, as I do here, use closing prices. Alternatively you could be starting with an intraday irregular series of prices from each instrument. The right thing to would be to up sample them to some relatively high frequency; perhaps somewhere between microseconds and hours depending on the speed of your trading system and your tolerance for the accuracy of the spread.

You can then calculate the spread at the upsampled frequency, which should be quite a bit faster than you would really need. Depending on the nature of what you're doing this could be quite noisy (mismatched bid-ask bounce, different liquidity or delayed reaction to news for example). It's probably worth down sampling back to the frequency you actually want, and using an average when you do that to smooth out the noise.

Here is another one. Imagine you want to compare the price of LIBOR futures with the underlying LIBOR spot rate. The former is available tick by tick during the whole of a trading day. The latter is 'fixed' (unfortunate choice of terminology, given what's happened) and published daily.

As you can see it's very easy to spot people fixing LIBOR because they carry their ill gotten gains around in big yellow bags with £ signs on.

In this case we only know the 'true' value of the futures - spot spread at a single moment. The correct approach is to resample the futures prices, getting each day a price as close as possible to the 11am fixing time.

Then day by day you compute the spread between the fix and the 11am fix. In our database that should have a published time stamp of when the actual fix came out (normally around 11:45am).

Dealing with non-current data

Okay, so we've got one or more non-current data items. It might be that we're missing the last price, or that it's been a couple of weeks since we had an a figure for real yields (for a system whose natural frequency is daily), or that another market which we need a price from is closed today.

No position

A very firm stance to take on the issue of non current data is to have no position at all if you're missing any data whatsoever. If you're a high frequency day trader who needs to see the entire order book to decide what position to have, and its stuffed full of stale quotes, then it's probably reasonable to do nothing for a while.

However for the vast majority of people this is madness, and would lead to extremely high trading costs if you closed your position every time you were missing one lil' bit of data.

Forward filling

The simplest, and probably most common, approach is to forward fill - take the last known good value of your data thing and use that. Philosophically we're saying that we can't observe data continously, and the best forecast of the current unknown value of the data is the last value we saw.

It can sometimes be useful to have a concept of maximum staleness. So suppose you're trading using monthly inflation data at a natural frequency of daily. It will be normal to have to forward fill this for a few weeks. On the other hand if you haven't had any inflation data when you last expected to, and you're exceptionally having to forward fill for 2 months, then you might want to treat that differently and use one of the other methods.

Be careful about premature or incorrect forward filling, which can lead to incorrect calculations across multiple data series. Returning to the very simple example of a dividend yield calculation, forward filling the dividend yield rather than the dividend would have given us the wrong answer.

Extrapolation

We might be able to do better at forecasting the current value of an unobserved variable by extrapolating from the past data we have.

Extrapolation could involve fitting a complex time series model of the data and working out the next likely point, or something as simple as 'drawing a line' through the last couple of points and extending it.

Temporary de-weighting

This is only applicable where you've got more than one data item.

I'll need to briefly explain this. Imagine you've got a system where your optimal position is a function which depends on a linear weighted average of some other functions, each using a different set of predictive data.

So you might have something like

position_t = f [ A x g (data_1_t) + B x h (data_2_t) ]

... where f,g,h are functions, A and B are weights, and data are boringly data. Now what if one of those bits of data is missing?

One option is to temporarily set the weight (A or B) to zero. With missing data you are effectively saying you have less confidence in your position, so its reasonable to make it smaller to reflect that. The disadvantage of this approach is the extra trading it generates.

Note that isn't the same as 'if a piece of data is missing, set it to zero'. That would lead to very whacky results, if for example you have a long series of prices of 100, 101, 102, ... and then a zero.

Temporary re-weighting

Again this is only applicable where you've got more than one data item.

Let's return to the last equation again:

position_t = f [ A x g (data_1_t) + B x h (data_2_t) ]

Suppose that normally A and B are both 0.50.

Rather than deweighting and getting a smaller position by setting B to zero when data_2 is missing, we could get data_1 to 'take up the slack'. In this case we'd set B to zero, but then inflate A to 1.0.

This method produces extra trading, but not as much as full de-weighting.

Inference

It's sometimes possible to infer what a missing data value is from values of other data items that we do have. So for example you might have a series of interest rates of different maturities; and occasionally you find that the 7 year point isn't available.

Given some model of how interest rates are linked you'll probably be able to do a pretty decent job of inferring what the 7 year rate is.

I have no idea what this picture means. But it looks pretty.

Missing data

Non current data is effectively a special case of a broader problem: missing data. We can also have missing data which isn't current, eg historical data. Missing historical data can be for example a missing day in a series of closing prices.

Generally you can adopt the same approaches - no position, deweighting, reweighting, forward filling and extrapolation. Depending on exactly what you're doing you could also use interpolation of course. So if for example you have a missing price one day in your history, you could do something simple like a linear interpolation, or use a brownian bridge if you were worried about understating volatility.

Just be careful you don't end up back testing using forward looking information that wasn't available when trades were done.

Event driven systems - any better?

A whole lot of people are laughing at this post up to now, and that is those using event driven systems (but then they generally laugh at everything I write ).

With an event driven system you wait for your data to arrive and then you act on it. You might think therefore that you are immune from the missing data issue. No data - no action. This is quite different from the paradigm I'm generally working with; checking now and then to see what the latest data is, and then based on that deciding what to do.

The hilarity of the event driven people is probably fair, but with two big enormous caveats. Firstly they should still be concerned with non-current stale data. If you're operating some whacky intra day system with an expected holding period of five minutes where you look for a particular pattern to enter and another to exit, what happens if you've had your position for ten minutes and you're for whatever reason not getting quotes updated?

You may pooh pooh the chances of this happening but automated system engineering is all about dealing with the unexpected.

Secondly if you're operating on multiple data then it's perfectly possible to have concurrence problems; a price might arrive triggering an event but if your function also relies on another piece of data you still need to worry about whether that is stale or not.

So - stop giggling at the back. You've got nothing to be smug about.

Outliers

There is one more set of engineering that needs to be done in position generation code; and that is dealing with outliers. Naturally you'll have screened your outliers to ensure they are genuine, and not bad data.

You can get genuine outliers in both single data series, and also in multiple series.

Let's take a single data series example. Suppose you're calculating volatility from a series of daily returns. The last daily return is.... rather large.

Is it reasonable to use the entire return unadjusted in your calculation, resulting in a big spike in volatility? Or should we limit the size of the return to avoid biasing the value we get?

In this situation there is also the option of using a more robust calculation. So for example you could use an interquartile range rather than a standard deviation; you can use medians rather than means, and so on.

Here's an example from multiple series. This time we're calculating an index of industry returns from individual equity prices. One firm has a big return, pushing the whole index up. Again do we live with this, or take action?

There is no right or wrong way to handle these, and the definition of 'outlier' (the point at which a big point becomes too extreme) is situation specific and down to preference, however you need to consider and have a policy for them.

Done

Congratulations. You can now build a slightly safer trading algo than you could when started reading this. Now for the easy bit - designing one that will consistently make money :-)

This is the third post in a series on building systematic trading systems.

The first two posts are:

http://qoppac.blogspot.co.uk/2015/04/system-building-data-capture.html
http://qoppac.blogspot.co.uk/2015/05/systems-building-futures-rolling.html

The next post is here.

17 comments:

Sunyin16 June 2015 at 20:27
Great post, keep up the detailed insights into all the challenges aside from the usual focus on alpha generating signals.
ReplyDelete
Replies
Sunyin16 June 2015 at 22:41
This comment has been removed by a blog administrator.
ReplyDelete
Replies
Alex Shestakov26 June 2015 at 11:44
Hi Rob, I think you meant "minutely", not "monthly" in up-sampling definition.
"Up sampling: The reverse. Eg go from daily to monthly data."
ReplyDelete
Replies
Anonymous21 November 2016 at 18:39
Hi Rob,

How do you execute the fx hedge? Is there a specific order type that does it simultaneously or is it a separate module that takes the sizes of the positions that need to be hedged and execute the hedge separately?

Many thanks.
ReplyDelete
Replies
Anonymous28 November 2016 at 00:15
My apologies for the delay(the whole Thanksgiving thing)!. Thank you much for the response.

Would mind just running through a quick example of how this works? I am obviously USD based. For example, if I wanted to buy $100 USD worth of something denominated in GBP and the margin is, say, 5%. I would convert 5 USD to GBP(lets assume 1-to-1 fx rate). Then buy the GBP future. So then the only FX exposure I have if the 5 USD(GBP) and plus, or minus, my PnL? Am I thinking of this correctly?

Maybe you have a clearer example?

Thank you so much.
ReplyDelete
Replies
Anonymous5 December 2016 at 05:32
Hi Rob,

For cross-sectional strategies how do you weigh them? Linearly((n+1)/2 and then scale sum to +1 long and -1 short) or quantiles(eg. equally long top 20% and equally short bottom 20%).

Why do you choose one versus the other? And I guess can you diversify by doing both?

I tend to prefer the linear weighting system since it eliminates a parameters(eg which quantiles to choose).
ReplyDelete
Replies

Add comment

Comments are moderated. So there will be a delay before they are published. Don't bother with spam, it wastes your time and mine.