In the last post in our machine learning series, we showed how nonlinear regression algos might improve regression forecasting relative to plain vanilla linear regression (i.e., when underlying reality is nonlinear with complex interactions). In this piece, we’ll first review machine learning for classification, a problem which may be less familiar to investors, but fundamental to machine learning professionals. Next, we’ll apply classification to the classic value/momentum factors (spoiler: the results are pretty good). Part 1: An Introduction to Classification Algorithms Part 2: Applications in factor investing
Part 1: An Introduction to Classification Algorithms
What Is Classification?Regression predicts a continuous value: for example, the return on an asset. Classification predicts a discrete value: for example, will a stock outperform next period? This is a binary classification problem, predicting a yes/no response. Another example: Which quartile will a stock’s performance fall into next month? This is multinomial classification, predicting a categorical variable with 4 possible outcomes. In this post:
- We’ll break down a classification example “Barney-style” with Python code.
- We’ll present a high-level overview of classification algorithms.
- Finally, in part 2, we’ll apply classification to a portfolio to generate an investment strategy by classifying expected returns by quintile.
Classification essentialsWe start (code is here) by generating random data with two predictors (the x-axis and y-axis) and a variable with two labels (red or blue). The numbers are generated from a truncated normal with [0,100] bounds (see here for details).
- Apply a linear decision function of x and y that outputs a numeric variable:
- Apply a sigmoid ‘squashing’ function to Z that maps large positive numbers to a probability estimate close to 1, and large negative numbers to a probability close to 0.
- Apply a loss function that measures prediction error. The loss should be close to 0 when the prediction is close to 1 for the blue observations (label=1); and also close to 0 when the prediction is close to 0 for the red observations (label=0).
- Finally, find the parameters a, b, c of our linear function that minimize the loss function.
where y is our label (0 for red, 1 for blue) and p is our predicted probability of label = 1 (blue). A good way to visualize log loss is as -log(correctness). If our prediction is 100% correct, log loss is 0. If our prediction is 0% correct, log loss is +∞. To break it down further: Our observed label y is either 0 or 1. Our probability prediction p is between 0 and 1, exclusive. 4 If the observed label y is 1 (i.e., blue), and our predicted probability is p (i.e., .8), then . The second term goes away since 1-y=0. For a prediction close to 1, the log loss is close to 0. For a prediction close to 0, the log loss is very large. Conversely: If the observed label y is 0 (i.e., red), then . The first term goes away since y=0. For a prediction close to 0, the log loss is 0. For a prediction close to 1, the log loss is very large.
- Find a, b, c…
- that minimize the average log loss of the sigmoid function …
- over all training observations.
A Final PointWe predict the blue class when the predicted probability exceeds a probability threshold. In general, we should not assume that our decision boundary must be at the 50% probability line. In a real-world problem, we should consider the cost of false positives vs. false negatives. As we raise the probability threshold, we reduce the number of false positives and increase the number of false negatives. The ROC curve can help us visualize that. We want to choose a decision boundary threshold that maximizes real-world performance, where if we increase the threshold for a positive prediction, the marginal gain of fewer false positives equals the cost of additional false negatives.
An Overview of Classification AlgorithmsIn our classification code template, we can easily swap out logistic regression for any other sklearn classification algorithm. We can even enumerate all the available classification functions, and run them all on our data via our template:
Key Questions a Classifier Must Address
- What underlying distribution of y given X are we modeling? (Logistic regression: linear relationship between predictors and log-odds.)
- What is the shape of the decision boundary? Linear, piecewise linear, nonlinear, what broad functional specification? (Logistic regression: linear decision boundary)
- What is the loss function we are minimizing over the underlying distribution and functional specification? Popular loss functions are mean-squared-error (for regression), cross-entropy loss, hinge loss, Huber loss. (Logistic regression: Log loss or cross-entropy)
- What is the mechanism for controlling overfitting and optimizing the bias/variance tradeoff? (Logistic regression: Regularization can be used to apply a penalty to large coefficients and shrink them, biasing the model toward low coefficients. 8
- Finally, what does the algorithm look like? i.e., How do you train it? How does it predict y given X? What is the computational complexity of the algorithm? (e.g., Logistic regression is simple and fast, neural networks take a relatively long time to train and tune.) Is the classifier generative or discriminative? A generative classifier models the full joint distribution of X and y; a discriminative classifier creates a less complete model. (Logistic regression just models P(y|X) as opposed to the full joint distribution, so it is a discriminative classifier. 9
Digging in on the Classification MethodsThe first 5 classifiers in the classification ensemble picture above produce linear decision boundaries. Using our fake problem above, these classifiers produce essentially identical results. They have minor differences in assumptions about underlying distribution, loss functions, and controls for overfitting. Other classifiers have smooth nonlinear decision boundaries: Quadratic Discriminant Analysis, Multi-Layer Perceptron, Gaussian Naive Bayes, and our Keras NN (which is similar to MLP). Others are piecewise linear or smooth: KNN, Trees, Bagging, Boosting. For example, K Nearest Neighbors (KNN) works by finding the k nearest neighbors of any point (e.g., k=5). Assign the label of the majority of those 5 neighbors. Choose k based on best performance in cross-validation. Boosting models are currently the state of the art for classification of tabular data 10. They win a lot of Kaggle machine learning contests. They use ensembles of decision trees. 11 Here are some high-level descriptions of various techniques: Decision trees: Find the rule of the form x or y > k, which labels the training observations and gives the lowest average loss. In other words, partition the graph into 2 parts and label one partition 0 and the other 1, using a simple one-variable rule that minimizes the loss function. Next, take each partition, and partition it in the way that minimizes the loss function. Continue until no further loss reduction can be achieved in cross-validation. This tends to yield a patchwork of small squares (and will overfit training data if you don’t cross-validate carefully). Bagging: Instead of using the single best decision tree for your training data, 1) build many decision trees using a randomly selected half of your data, and 2) have them vote on the prediction for each observation. This is ‘bootstrap aggregation’, hence bagging. It acts like regularization. A single decision tree will produce the best possible partition of your training data. Bagging, by having many classifiers that each see part of the data, will produce a slightly worse result in your training data, but will generalize better out of sample. Many weak rules are better than a single perfect rule, because they are less likely to overfit the training data. We can control overfitting by adjusting the number of trees, the depth of the trees, the size of the random subsets, and choose the best combination of these hyperparameters using cross-validation. Random forest: Build trees using both random subsets of data and random subsets of predictors. This prevents the classifier from becoming over-reliant on any particular set of predictors and generally works even better out of sample than bagging. Boosting: Bagging and random forest are ensemble methods which generate many independent weak classifiers that are run in parallel and then aggregated by e.g. voting. Boosting is an ensemble method where you train many classifiers, but in sequence, at each step training a new classifier to improve prediction on the observations that were previously misclassified. Ensembles: After running many classifiers and selecting good ones, we can feed their output to the final classification model. In this example, we ensemble the models using a voting classifier and we get about 85.9% accuracy and AUC of 92.4% (a slight improvement vs. the simple logistic regression): 12
Conclusions Regarding Classification AlgosWhat have we learned?
- The fundamentals of classification.
- How to perform and interpret logistic regression — a simple classification algorithm.
- Key characteristics of different classification algos.
- Boosting is the state of the art for tabular data.
- How to create an ensemble that combines several classification algorithms for improved results.
Part 2: Factor Investing ApplicationsIn Part 1, we reviewed the fundamentals of classification. In part 2 we will cover the following:
- Applying classification to a value and momentum model;
- Evaluating classification’s performance in a mean-variance framework;
- Comparing classification to regression.
The Basics of Value and Momentum
Buy Cheap: ValueA value strategy buys stocks that look cheap: a fundamental valuation metric indicates they represent high underlying return potential per dollar invested. We use price/book value (P/B), which is simple and often used in academic research (e.g., Fama and French 1992,1993). We bucket our universe into P/B quintiles each month and simulate returns of each quintile the following month. All results are from 1973 to 2017 and reflect total returns, which include distributions. 13
Buy Strong: MomentumWe similarly backtest a relative strength momentum strategy: buying the stocks that have been going up the most recently (12-2 momentum). There are plausible reasons stocks may gradually react to good news, which has the potential to create momentum effects.
- Gradual information diffusion: As prospects improve, initially only insiders realize it; then profits increase, but are viewed as possibly transitory; then markets price in durable improvement.
- Under-appreciated second-order effects: Companies that outperform have advantages in hiring, deal-making, media coverage, which enable them to continue outperforming, perpetuating a virtuous cycle as in Soros’s reflexivity thesis.
- Behavioral biases like recency bias; reporting biases like earnings management to show gradually increasing growth.
Cheap and Strong: Value and MomentumTo give ourselves a single baseline for value and momentum, we combine the value and momentum scores into a single score 14, rank the stocks by combined score, and create quintile portfolios, rebalanced monthly.
Do Simple Classifications Work?Well, yes! We just did the most basic naive classification: Put each stock into a quintile bucket according to its trailing value / momentum rank. Invest in e.g. the top quintile, and rebalance monthly. A slightly more sophisticated framework: Use logistic regression to predict which return quintile bucket each stock will fall into and invest in those with the highest predicted returns. Essentially, we are saying that a 50/50 combination of the value and momentum scores is not necessarily ideal. We want to find a linear combination of the predictors that produces the best classification into quintile buckets. The predictors are value (P/B), momentum (12-2), and, as another twist, a dummy variable for financials. The impact of P/B varies across industries, and financials may be the most extreme example. The discrete response variable is the quintile each stock’s 3-month return will fall into over the next 3 months. I chose 3-month return to smooth out some of the noise in monthly returns. We initially train on 120 months (10 years) and predict the quintile each stock will fall into on the 121st month. Then we walk forward month by month, using all available history to predict the next month. Finally, since we predicted 3-month returns, we simulate investing 1/3 of the portfolio each month in each quintile and holding them for 3 months. We chart performance of each quintile portfolio:
Can We Fix Classification?To rescue classification, given the classification probabilities above, I computed a weighted average bucket prediction: 0.30 x 0 + 0.09 x 1 + 0.11 x 2 + 0.21 x 3 + 0.29 x 4 = 2.1 This prediction of slightly above the mean is more reasonable than assigning bucket 0, given these probabilities. Then I can bucket stocks into quantiles using this prediction, and get consistent, balanced buckets. 17 Here are the results of using a tweak on the classification buckets:
Does More Advanced Classification Work?Maybe we’re on the right track. Can we improve on the baseline using a more advanced classifier? Gradient boosting models are the current state of the art for classification and regression using tabular data. We use XGBoost to predict the same discrete response variable: which quintile will a stock’s next 3-month return fall into. 18 We use the same methodology of predicting quintile probabilities, computing a weighted-average quintile prediction, generating predictions for 1973-2017 (we train on the 10 years prior) by walking forward one month at a time, and finally simulating a portfolio investing in each prediction quintile. Note: These results are preliminary and should be taken with a huge grain of salt. We’ve got a lot more work to do and need to dig into the weeds (which we’ll do for future posts). Here is performance by quintile:
Is Classification the Right Answer?The XGBoost classification result is promising (and almost too good to be true!), but the way we obtained it forced me to address reasons not to use classification for this problem:
- Loss of information when we collapse the return target variable into quintile buckets. This may not be all bad. Using 3 month returns instead of 1-month returns loses information, but much of the lost information may be noise. Bucketing similarly may reduce overfitting to noise, improving results, like regularization.
- Loss of information due to lack of order. 19 As discussed above, log loss only cares about predicting the correct quintile as often as possible, and has no concept of consistent ordering, or getting as close as possible to the correct quintile.
- Portfolio construction can be expressed inherently as a classification problem: you classify stocks when you bucket them into portfolios.
- When we minimize regression mean squared error (MSE), improving MSE didn’t always get better portfolio buckets (partly because we had low R-squared), as we found in a previous post.
- I wanted to write about a classification example and understand/explain it in a portfolio selection context.
Why Not Just Use Regression?Finally, let’s compare XGBoost classification to XGBoost regression. In contrast to XGBoost classification, which predicted discrete quintile buckets, we predict the 3-month returns as a continuous variable, minimizing mean squared error. Then, we bucket the forecast returns into quintiles and simulate investing in each quintile.
Conclusions on ClassificationWhat have we learned?
- Simple value and momentum strategies work well.
- Vanilla classification has problems in a portfolio selection context, and doesn’t improve on a simple average of value and momentum scores.
- Gradient boosting classification improves significantly on simple strategies.
- Regression may be a better starting point for this problem, and gradient boosting regression improves significantly on gradient boosting classification.
- Theoretical models should be easily understandable, and inherently over-simplify. Highly optimized empirical models are hard to interpret, and we need to be very careful about overfitting limited data.
- Machine learning models in general, and especially gradient boosting models, are on the black-boxy end of the spectrum.
- We should apply a reasonable amount of optimization for the real-world problem we are trying to solve. The nature of the beast is that gradient boosting probably selected rules that were maybe a little lucky to work really well in the past, and other people have probably found similar patterns and are trying to exploit them going forward. So we shouldn’t expect this to work as well in the future.
- Test additional factors: EPS, FCF, PEG, or multi-metric value models instead of P/B; quality, size.
- Industry (or large scale sectors like financial/nonfinancial/tech where value may be measured differently). A good historical industry taxonomy for this purpose is important and the 48 Fama/French industries aren’t ideal.
- Smoothing methods like a denoising auto-encoder; other ways to optimize turnover.
- Ordinal regression methods.
- Portfolio optimization using regression forecasts as input.
- Gain a better understanding of the nonlinear relationships XGBoost is finding, and using that information intelligently to come up with simple rules which may be faster and more robust, for instance using nonlinear thresholds, polynomial features (squaring and cubing the predictors).
- The distance from the line defined by to a point is . Since is constant with respect to x and y, has a linear relationship to the classification margin.
The red/blue shadings define a plane if you visualize color intensity in the z-axis. The blue areas are positive, the red areas are negative, with z = 0 at the decision boundary. (Although our colorings are based on logistic_function(Z), not Z itself) ↩
- To match Vegas odds, we have to take the inverse of the way they are normally quoted, or take but you get the idea. ↩
- This is not the only possible ‘squashing’ function. We can use any function that maps the interval [-∞, +∞] to the interval [0,1]. We could use sigmoid(straight odds) instead of sigmoid(log-odds). We could use the normal cumulative distribution function (CDF) which maps the z-score to the area under the normal distribution up to that z-score — our familiar statistical significance tables. If you take the derivative of the normal CDF function, you get the probability density of the normal distribution. If you take the derivative of the logistic function, you get the probability density of the logistic distribution: the logistic function is the CDF of the logistic distribution. If we use the normal CDF as our sigmoid function, we are doing a probit model. If we use our logistic sigmoid, we are doing a logit model. The choice of sigmoid reflects the distribution you are modeling. In machine learning, we generally use the logit model. If you have a good reason to believe you are modeling a normal distribution, and the distance from the decision boundary has a linear relationship to the z-score, then the probit model is technically correct (the best kind of correct). If you don’t know your distribution is normal but have good reason to think log odds can be modeled as a linear function of predictors, the logit model is correct. The logit model is computationally simple to optimize, the log-odds relationship is a reasonable assumption even when the modeled distribution is not normal, and in practice logit and probit sigmoids are very similar:
- Exclusive since our sigmoid is never exactly 0 or 1 except as a limit when the decision function goes to -∞ or +∞. ↩
- Editor’s note: If readers are more familiar with excel, here is an example of how this process works via solver. ↩
CV Threshold 0.4236 Accuracy 0.8475 F1 0.8562Out-Of-Sample Accuracy: 0.8442Confusion Matrix:[[1581 419] [ 204 1796]](True negative 1581 True positive 1796 False positive 204 False negative 419)AUC: 0.9157
- You might even say we are classifying the classifiers. (YEEAAAAH!) ↩
- L2 regularization will minimize the sum of the log loss and the squared coefficients multiplied by some constant λ. It will tend to shrink the coefficients towards 0 and reduce the impact of outliers. It will also tend to equalize the magnitude of the coefficients, biasing the model toward a 45° line: since the derivative of x2 is 2x, shrinking a much larger coefficient will have much more effect on the overall loss. L2 regularization corresponds to a Gaussian prior distribution on the coefficients centered at 0. L1 regularization corresponds to a Laplacian prior on the coefficients centered at 0. ↩
- If it modeled P(X) and P(y|X), which are sufficient to fully specify P(y) and P(X|y) then it would be a generative classifier. ↩
- Tabular data as opposed to classification of images, audio, complex problems that benefit from deep learning. ↩
- Technically, you can apply boosting to any differentiable classifier. But the tree models are the most popular and successful. ↩
In-Sample Accuracy: 0.8535Out-Of-Sample Accuracy: 0.8592Confusion Matrix:[[1635 365] [ 198 1802]](True negative 1635 True positive 1802 False positive 198 False negative 365)AUC: 0.9237 ↩
- Alpha Architect and other researchers have found different metrics may be better: P/FCF, EV/EBITDA, other engineered metrics or multiple-metric approaches. ↩
- We standardize the entire series so value and momentum have the same mean and standard deviation, and add the two scores. (It may be more technically correct to standardize each month independently.) ↩
- also called binary cross-entropy ↩
- Not just a contrived question since a high-beta distressed stock has a lot of potential to move from the bottom to the top momentum quantile. ↩
- But this generates a continuous value instead of a discrete value, effectively transforming classification back into regression, and leading me to question my life choices. ↩
- In addition to XGBoost, Python scikit-learn has GradientBoostingClassifier and GradientBoostingRegressor, Microsoft has LightGBM and Yandex has CatBoost. The last two, like XGBoost, are very fast and highly tunable. XGBoost has been around the longest and, if no longer the undisputed champion, is holding its own against the upstarts. ↩
- If we converted our quintile classification to a binomial 2-class outperform/underperform problem, the ordering issue goes away, since 2 classes are always in some consistent order. But a tool that always invests in 50% of the market is a blunt one. ↩
- There are ordinal regression models which constrain the output probabilities to monotonically decrease as you get farther from the most probable class. There are ranking models, and other loss functions, which may be worth considering. As the hipsters say, they’re kind of niche so you might not have heard of them. But these concepts seem well-suited to this problem. That might be a future mad science experiment. An alternative approach is to use a regression score as an input to an optimization, where we find the highest scoring portfolio subject to a maximum volatility or tracking error constraint. It gets us away from arbitrary quintile buckets and may use the regression score information more intelligently. ↩