Overfitting II: Out-of-Sample Testing

Previously I wrote a note on overfitting during training. Now after reading that, let’s imagine a normal scenario-

You’re trying to find a strategy with an edge and you’re considering a 3 types: a moving average crossover momentum strategy, an RSI-threshold strategy, and a buy-after-gap-down strategy. Being a modern quant trader, you know that regular, automatic parameter optimization is the only way to make an adaptive, fully automated system. The goal of system development, of course, is to determine which strategy is best.
After reading the previous note on overfitting you’re smart enough to have split your data into two sets, one for training and one for testing.
The training set is used with crossvalidation to find the best parameters for the strategy. You are [separately] having it automatically optimize the two moving average lengths, the RSI period, and the minimum downward gap threshold. Those are the obvious parameters. Then the out-of-sample test set is used to measure the performance of each strategy, generating PnL, max drawdown, sharpe etc.
Following this, you compare the results and based on the PnL curve and careful scrutiny, you pick the best system.

What was the probl…

Previously I wrote a note on overfitting during training. Now after reading that, let’s imagine a normal scenario-

You’re trying to find a strategy with an edge and you’re considering a 3 types: a moving average crossover momentum strategy, an RSI-threshold strategy, and a buy-after-gap-down strategy. Being a modern quant trader, you know that regular, automatic parameter optimization is the only way to make an adaptive, fully automated system. The goal of system development, of course, is to determine which strategy is best.
After reading the previous note on overfitting you’re smart enough to have split your data into two sets, one for training and one for testing.
The training set is used with crossvalidation to find the best parameters for the strategy. You are [separately] having it automatically optimize the two moving average lengths, the RSI period, and the minimum downward gap threshold. Those are the obvious parameters. Then the out-of-sample test set is used to measure the performance of each strategy, generating PnL, max drawdown, sharpe etc.
Following this, you compare the results and based on the PnL curve and careful scrutiny, you pick the best system.

What was the problem in the above? Considering three strategies introduced a hidden parameter that slipped past crossvalidation. Go back and imagine a bigger system that has a portfolio of strategies, MA, RSI, and gap-based. These are numbered 1,2,3. So this system has an extra parameter s={1,2,3}. It also has the parameters for each strategy as mentioned above. When this system reaches the crossvalidation loop, 1 final result pops out. Previously we had 3 results and then we chose the best.

This is equivalent to overfitting on the training data. Convince yourself of this fact. They appear different because of the different purposes/names we have assigned the ‘training’ and ‘test’ sets. In fact, picking a model at the end was equivalent to training. Now generalize how we showed the equivalence of overfitting on the training and test sets to cases where the system follows a more complex adaptive strategy, with layers on layers of auto-optimization validation loops.

Test-set overfitting is typically worse than the above because in most cases you will be considering more than 3 strategies. First example: you are haphazardly searching for some edge by trying any kind of strategy you can imagine. Second example (more insidious): you are testing different kernels on an SVM. You will think that you have found that one kernel is more applicable to the domain of financial forcasting, but actually it’s an illusion. Ignore ‘intrinsic’ meaning and just conceptualize any options as a parameter list (unfortunately combinatorally large).

—
This part is just me thinking of ideas and writing. It’s a bit off the deep end: you should stop here unless the top part sounded like old news and was 100% intuitive on the first read-through. —

Hypothetically speaking, if the system had been trained and tested on an infinite amount of data overfitting would not be a problem (as long as the number of parameters is finite (??)). And I don’t mean including all time periods (ex. take every other period- still infinite but not including all time and overfitting would not be a problem). Unless you test on all the data that happens in the future, and not just your out-of-sample set (obviously impossible), you risk fitting the expression of noise that is specific to that set. You will think you have found a pattern in the stock market, when really you have found a pattern in the noise. All finite sets of numbers have patterns, for example the list of all the numbers repeated once. If this is the only pattern, and no sequence repeats more than once, then you will not suffer from too much overfitting even if you follow a flawed procedure as described above. The noise will only truly become noisy once it is infinitely long and there are no more persistent patterns. ‘Until that point’ it will not be perfect noise and you must beware around it.

When you test on anything less than infinite data, you risk selecting the fateful subset of the data that your system happens to predict perfectly. Fortunately your odds of selecting a highly patterned set from the noise decrease exponentially as you use a larger test set ( 1 / k^n ). Just remember that the possibility exists in the universe that this was all by chance. [Maybe the laws of physics are false and actually every human observation till now has simply happened be perfectly correlated with some perfectly meaningless, unrelated formulas Newton happened upon.]
——

If you can’t recognize all incarnations of overfitting, you will not be able to accurately test a self-adapting system. You can’t even get to the point of looking for an edge of this type because you don’t know how to see.

I would like to see research going more in depth on overfitting, beyond what I’ve mentioned so please leave a comment if you know of a source.