Friday, 25 September 2015

A story of poor statistical intuition

In continuing my futile quest to raise the level of debate in the quantitative investment community I thought I'd have a go at another clever and very wealthy guy, Cliff Asness, founder of giant fund AQR.

Cliff Asness. There is nothing wrong with his statistical intuition. Or his suit. Nice suit Cliff. (
To be precise I'm not having a go at Cliff (at some point these billionaires are going to start lawyering up) but at the authors of this specific paper:

And, to disappoint you further, I'm not really going to 'have a go' at the authors of the paper.

I actually agree completely and wholeheartedly with the sentiments and the conclusions of the paper. I also know that the authors have done and also published other, very good, research which supports the results.

What I have a tiny little problem with is the way that the analysis is presented. More specifically it is about the interaction of a particular human weakness with a particular way of presenting data.

I should also say upfront that this is a problem by no means limited to this paper. It is endemic in the investment industry; and much worse beyond it*. It just so happens that I was sent this paper by an ex-colleague of mine very recently, and it got me a bit riled.

* Here's a great book on the subject of mis representing scientific statistical evidence which is worth reading.

So this post is not a criticism of AQR in particular; and I should reiterate that I have a huge amount of respect for their research and what they do generally.

What does the paper say

If you are too idle to read the paper it looks at the quarterly returns of a bunch of assets and style factors, conditioned on there being a terrible quarter in eithier the bond or equity markets (they also do some stuff with overlapping 12 month returns). For example if you skip to exhibit 1 it shows that for the 10 worst quarters for equities since 1972 fixed income made money in 8 out of 10.

The ex colleague of mine who sent me the paper made the following comment:

"Hello Robert,

Have you seen the paper attached yet? It is interesting that global bonds had a decent performance in the 10 worst quarters for global stocks in 1990-2014, but not the other way around... Trend following seems to have good performance when either stocks or bonds suffer."*

* I've decided to leave my friend anonymous, although he has kindly given permission for me to use this quote.

My first thought was "hmm... yes that is interesting". 

Then after a few minutes I had a second thought:

"Hang on. There are only 10 observations. Is this even statistically significant? "
This was a serious problem. More so because the authors of the paper had also highlighted a key finding, which relates to something I talked about in my last post, trend following. From the paper:

"Trend was on average profitable in all asset classes returns during these equity tail events... As noted, Trend has often performed well in the worst equity quarters... Trend has been a surprisingly good equity tail hedge for more than a century"

I stared at the numbers, but I still couldn't decide whether they were meaningful or not. The underlying problem here is that humans are rubbish at intuitively judging statistical significance - even ones like my friend and I who actually understand the concept.

A bit of small sample statistics

Before proceeding let me briefly explain my outburst on statistical significance. Those who, unlike me, saw immediately whether the results were statistically significant or not can smugly skip ahead.

If we abstract away from the specifics we can reword my friends statement as follows:

"Hypothesis 1: The average return for Bonds, conditional on poor returns for Equities, is positive."
"Hypothesis 2: The average return for Equities, conditional on poor returns for Bonds, is negative"
"Hypothesis 3: The average return for Trend, conditional on poor returns for Equities, is positive"
"Hypothesis 4: The average return for Trend, conditional on poor returns for Bonds, is positive"

The third hypothesis is also one of the main points the author's flagged up.

(By the way, and this is just a small criticism, it might have been more intuitive if the graphs in the paper had been done with Sharpe Ratio units rather than mean returns on the y axis; although do the authors quote the Sharpe Ratio's in the heading. Specifically it would probably have made sense to normalise the quoted returns by the volatility of the full sample.

However I guess what is more intuitive for me might not be to many other people; so I can live with this.)

Notice that in each of the four hypothesis we have an asset we're trying to predict returns for, and another asset that we are conditioning on. We can abstract further to avoid having to model the joint distribution in an explicit way (you can do this of course, but it would take longer to explain):

"Hypothesis 1: The average return for Bonds, in scenario X, is positive."
"Hypothesis 2: The average return for Equities, in scenario Y, is negative"
"Hypothesis 3: The average return for Trend, in scenario X, is positive" 
"Hypothesis 4: The average return for Trend, in scenario Y, is positive"

Obviously Scenario X is bad equity returns, and scenario Y is poor bond returns. The next thing we need to think about is what econometricians would call the data generating process (DGP). This isn't so much where the data is coming from, but where we pretend it's coming from.

We'll treat scenario X and scenario Y individually. Scenario X then consists of a sample of 10 returns drawn from a much larger population which we can't see. Scenario Y is another 10 returns drawn from a different population. The sample mean return for bonds in X is +3.9%; and for equities in Y is -2.7% (from exhibits 1 and 3 respectively). For Trend it's 6.4% for X, and 3% for Y.

I'm also going to assume that the underlying population is Gaussian, with some unknown mean; but with a standard deviation equal to that of the sample standard deviation*; which for bonds X is about 3% a quarter; for equities Y around 5.2% a quarter, and for Trend 7% (X) and 5.3% (Y). This is all a little unrealistic, but again it would be more complicated to do it another way, and it doesn't change the core message.

* Interestingly the full period standard deviation for bonds is 2.6%*** a quarter, equities 6.95%, and trend 5%. Risk seems to be a little higher than normal in equity crisis, but not so much when bonds are selling off.

** derived from annualised figures assuming no auto correlation between quarterly returns

Looking at my hypothesis the null I'm trying to disprove in both cases is that the true population mean return is zero (I could do it other ways, but this is simpler). So let me generate by repeated randomness the distribution of the sample mean statistic for 10 observations, given the estimated standard deviation:

In python:

import numpy as np
import random as rnd
import matplotlib.pyplot as plt

stdev_dict=dict(BOND=3.0, EQUITY=5.2, TRENDX=7, TRENDY=5.3)
sample_mean_dict=dict(EQUITY=-2.7, BOND=3.9, TRENDX=6.4, TRENDY=3.0)




ans=[np.mean([rnd.gauss(mean, stdev) for x in range(sample_size)]) for unused_idx in range(monte_carlo)]

if assetname in
    p_value=float(len([x for x in ans if x>estimate_mean]))/monte_carlo
elif assetname=="EQUITY":
    p_value=float(len([x for x in ans if x<estimate_mean]))/monte_carlo

    raise Exception("unknown assetname %s" % assetname)

thing=plt.hist(ans, 50)
ax2.annotate("%.4f" % p_value, xy=(estimate_mean, 0),xytext=(estimate_mean,max(thing[0])), arrowprops=dict(facecolor='black', shrink=0.05))

First for equities:

Some explanation. The blue histogram shows the distribution of the sample mean, over each of the 100,000 draws of 10 returns. The arrow shows the estimate from exhibit 3, -2.7%. The number above the arrow shows what proportion of the distribution is to the left of our estimate. So there is still a 5.09% chance that average equity returns are zero, conditional on poor bond returns; and we just happen to have drawn a particularly poor set of ten returns in our data.

This result is just shy of significance, using the normal 5% criteria. We can't reject the null hypothesis.

Then for bonds:

Here we have a different story. There is almost no chance that the average return of 3.9% has come from a population with mean zero. We can reject the null. Bonds are a good hedge for equities when things get really bad.

Now for Trend, conditioned on poor equity returns:

This is very significant. Trend is a great equity hedge. Let's see if it is any good for bonds:

Not quite as good, but just creeps into being significant at the 5% level. In summary:

"Hypothesis 1: The average return for Bonds, conditional on poor returns for Equities, is positive." - we can say this is very likely to be true.

"Hypothesis 2: The average return for Equities, conditional on poor returns for Bonds, is negative" - we cannot say if this is true or not.

"Hypothesis 3: The average return for Trend, conditional on poor returns for Equities, is positive" - we can say this is quite likely to be true

"Hypothesis 4: The average return for Trend, conditional on poor returns for Bonds, is positive" - we can say this is probably true

So my friend was mostly right; 3 out of 4 is pretty good. AQR were spot on; in fact the key findings they highlighted were hypothesis 1 and 3, the most highly significant ones. What's more my own personal feelings about allocating to trend following are still justified. However it has taken a fair bit of work to prove this! 

Why we have to tell stories to explain stuff

The crux of the problem is that it's really, really hard to judge what is significant or not in small samples. Most people don't carry around an intuition about these distributions in their heads. But using small samples is quite common in papers like these. The reason is due to a flaw in the human brain, a cognitive bias, narrative fallacy. Or to put it another way we like to hear stories.

If I show you a mass of data points you will probably be thinking 'yeah fascinating. Now what Rob?'. But if I show you a nice graph as in exhibit 1 of the AQR report you'll be thinking '4Q 1987. Black Monday! Ah I remember that / I've read about that (delete depending on age)...'.

1987 crash. Yes children it's true. In the olden days traders wore suits and ties; monitors were really, really big; and the only colour they could display was green (

The information becomes more interesting. Clever researchers know this, and so present information in a way which makes it easier to hang a narrative off.

Why this is bad

This is bad because a story can be both unrepresentative and also statistically meaningless. If I show you a story about an aircraft crashing you are more likely to avoid flying, even if I subsequently show you some dry statistics on the relative safety of different kinds of transport.

A sample of one. (

Stories, or if you prefer small samples, can lead us to the wrong judgement*.

* I'm aware that 'you can prove anything with statistics'. However it's true to say that a rigorous analysis of a large sample set, properly presented, is always going to be more meaningful than the inferences a small sample.

Sometimes this is deliberate; as in most tabloid newspapers reporting on medical research. Sometimes it's accidental.

Of course it might be that the small sample is statistically significant, in which case we can draw a conclusion about the general population, as in the case of three out of the four hypothesis we've tested.

However if I see a paper with some small sample results in it, but no indication of significance I don't know if:

  • The authors have deliberately shown an unrepresentative and insignificant sample, and the results are wrong
  • The authors have got an unrepresentative and insignificant sample by accident, haven't realised it and the conclusions are wrong
  • The authors have got a representative sample, but not a significant one. We can't prove the conclusion either way.
  • The authors have got a significant and representative sample (the authors may, or may not realise this. I expect the AQR authors did know. These guys aren't sloppy). The authors are correct, but I have no way of knowing this.
Thus there is no way to separate out malicous intent, incompetence and good research which happens to be missing the information needed to understand the significance (as we have here). You need to tell people the values are significant. They won't work it out for themselves (even if they know how, as I do, it's far too much work to generate distributions of sample estimators for every number in every paper I read).

It's for this reason that academic papers are littered with p-values and other statistics (though that doesn't mean you can trust them entirely). I'm not saying that a 'popular finance' paper like this should be festooned with statistical confetti. But a footnote would have been nice.


Don't be afraid of explaining the uncertainty in estimates. Talk about it. Explain it. Let people visualise it. And if you have got significant results, shout about it.

If you're worried that this blog is going to continue in this vein (criticising the research findings of hedge fund billionaires), don't worry. Next time I'll talk about something dull and worthy, like estimating transaction costs; or I'll give you some thrilling python code to read.

But if you're only now following me in the expectation that I'll be writing a post next week about David Shaw's inability to do stochastic calculus, or Ray Dalio's insistence on assuming returns are Gaussian, then I'm sorry you will be disappointed (and if their lawyers are reading, neither of those things are true).


  1. I've never been a fan of formal significance testing for validating trading strategies or similar. Any attempts to test for statistical significance are flawed because if you look closely enough every assumption they require will be violated in some way (stationary, normally distributed, independent, etc...).

    At some point you have to sit down and just ask yourself whether you truly believe your findings will persist into the future.

    1. I have to say I agree with you. I wouldn't run anything I didn't believe in no matter how significant, and I certainly don't do this level of testing for my own stuff.

      However I think when you're putting something out there for other people to read, you have a bit more responsibility to make sure they interpret it properly.

  2. Heh. I find it funny that just as I conclude a series of posts about validating every step of a trading strategy development with statistical testing, you come out with a post on it as well. Funny, that.

  3. Rob: I like the approach ( of course, assuming gaussian might be questionable ) but when you use the standard deviations in your simulation, I think you're using the unconditional ones when you should be using the sample standard deviation conditional on the other thing occurring ( which is going to be very difficult to estimate ). In other words, I think you should be using the standard deviations of the joint distribution because you'r standard deviation estimate does not take into account that the other event occurred.

    Definitely what you're doing is tricky so I may be misunderstanding. Thanks for theinteresting post. It definitely got me thinking more about Tom Sawyer's comment about statistics. I forget it exactly but it's something to the effect that statistics can lie.

    1. I'm actually using the conditional standard deviation; which I'm taking by using a population estimate based on the 10 known draws. I just quoted the unconditional out of interest. Sorry I clearly didn't make that clear enough.

      Yes Gaussian is an approximation. But with only 10 observations from the tail it's really hard to think how we could get any meaningful estimate of skew and kurtosis of the tail distribution.

      I think the fact that the conditional std. dev. in the tails are higher than the unconditional indicates that the overall distribution is non Gaussian ("like, duh" as Bart Simpson might say); assuming the tails are drawn from a different distribution to the body with a higher std. dev. is probably going to give you similar results to modelling the whole distribution using a non gaussian, or a non parameteric approach.

      Of course in this case we don't have access to the whole distribution, so it's a moot point.

  4. Hi Rob. I didn't realize that you were using that standard deviation. Thanks for the clarification.

    One way to avoid the gaussian assumption would be through bootstrapping. I don't know if you want to be bothered but it's just a heads up that there is a way around it. Even bootstrapping probably has some requirements about the "size of the population" you're sampling from so I'm not sure if the fact that you have only 10 observations would effect the boostrapping results. I'd have to look at Efron's bible and I don't have the time at the moment. Still, a nice article that shows how careful one has to be when reading statistical results.


  5. Rob, thanks for these posts, very informative.

    I'm in the process of backtesting a strategy and have a distribution of returns.
    I'd like to get some statistics on drawdown occurring since the strategy inception, with a view to being able to something like there is 10% probability of drawing down 20% in the first month.

    As i have this return series, how much sense does it make to create multiple returns series from this distribution via bootstrapping and look at the statistics over these sampled distributions? Is this a common technique? are there any pitfalls to look out for?

    Unfortunately there is another consideration in that this is a weekly strategy (out of position each week) and i only have aprox. 200 returns.

    1. Yes this approach is fine. Only 200 points is okay, that's the whole point of bootstrapping to get more value out of limited data.

      The main pitfall is that you will lose any time series dependence (autocorrelation of returns), which you can avoid by block sampling consecutive series of N returns with an appropriate length. This will reduce the effective number of data points you have, so don't make N too big. Maybe N=26 for weekly is okay? (6 months)

      The alternative is to generate random data with the properties you want as here: