once one of my pups found half a roast chicken in the corner of a parking lot and we had to visit that exact same corner every day for about fifty years because for dogs hope springs eternal when it comes to half a roast chicken – darth
Tic-Tac-Toe is a simple game. If both sides play perfectly, neither can win. But if one plays imperfectly, the other can exploit the flaws in the other’s strategy. Does that sound a little like trading? Reinforcement learning is a machine learning paradigm that can learn behavior to achieve maximum reward in complex dynamic environments, as simple as Tic-Tac-Toe, or as complex as Go, and options trading. In this post, we will try to explain what reinforcement learning is, share code to apply it, and references to learn more about it. First, we’ll learn a simple algorithm to play Tic-Tac-Toe, then learn to trade a non-random price series. Finally, we’ll talk about how reinforcement learning can master complex financial concepts like option pricing and optimal diversification. Let’s dive in!
Properly used, positive reinforcement is extremely powerful. – B. F. Skinner
What Is Reinforcement Learning? A Tic-Tac-Toe ExampleHere’s a simple algorithm that will learn an exploitive Tic-Tac-Toe strategy, and adapt over time if its opponent learns:
- Make a big table of all possible Tic-Tac-Toe boards.
- Initialize the table to assign a value of 0 to each board, 1.0 where X has won, -1.0 where O has won.
- Play with your opponent. At each move, pick the best available move in your table, or if several are tied, pick one at random. Occasionally, make a move at random just to make sure you explore the whole state space, and to keep your opponent on their toes.
- After each game, back up through all the boards that were played.
- When X wins, update each board’s value part of the way to 1.
- When O wins, update part of the way to -1.
- When they tie, update part of the way to 0.
Reinforcement Learning ConceptsBut first, let’s dig a little deeper into how reinforcement learning in general works, its components, and variations.
Figure 1. Markov Decision Process (MDP)This figure and a few more below are from the lectures of David Silver, a leading reinforcement learning researcher known for the AlphaGo project, among others.
- At time t, the agent observes the environment state st (the Tic-Tac-Toe board). 2
- From the set of available actions (the open board squares), the agent takes action at (the best move).
- The environment updates at the next timestep t+1 to a new state st+1. In Tic-Tac-Toe this is the board resulting from the opponent’s move. In a complex environment like a car on a road, the new state may be partly determined by the agent’s actions (you turn left) and partly by visible or hidden complexities in the environment (a dog runs into the road). And the new state may be non-deterministic, things can occur randomly, with probabilities dependent on the visible and hidden state and the actions of the agent.
- The environment generates a reward. In Tic-Tac-Toe you get a reward when you win, lose, or draw. In Space Invaders, you win points at various times when you hit different targets. When training a self-driving car, machine learning engineers design rewards for staying on the road, getting to the destination, including negative rewards for e.g. collisions.
- It’s based on the Markov chain model, which has states and probabilities of transitions between states.
- The vanilla Markov chain is extended with actions: at each state the agent can choose an action that influences the transition probabilities. The transition probabilities are a function not just of st but of (st, at).
- Each state transition is associated with a reward (possibly 0).
- Finally, the agent chooses actions based on states, using a policy function π(st) = at.
Policy Function ConceptsThe agent’s policy function chooses the best action based on the environment state. It may have the following components:
- Model – An internal representation of the environment. Our Tic-Tac-Toe agent has a model of the board, and it knows some state-action pairs result in the same state as other state-action pairs. A fully model-based algorithm explicitly models the full MDP with all transition probabilities, which our Tic-Tac-Toe algorithm doesn’t do. Other agents may be model-free. They choose actions without explicitly storing an internal model of the state or modeling state transitions. The model of the environment is implicit in the policy function. 3
- State value function approximator – A way to estimate the value of a state (our big table mapping boards to values).
- State-action value function approximator – A way to estimate the value of an action in a given state, i.e. a state-action pair, commonly termed a Q-value function.
Figure 2. Variations of Reinforcement LearningAll reinforcement learning variations learn using a similar workflow:
- Initialize the algorithm with naive, typically random, policy parameters.
- Using the policy, take actions, observe states before and after actions, gather experience rewards.
- Fit a model that improves the policy.
- Go to 2) and iterate, collecting more experience with the improved policy, and continuing to improve it.
Figure 3. Reinforcement Learning Workflow
Reinforcement Learning In ContextIn a previous post we discussed the differences between paradigms of machine learning:
- Supervised learning: Any algorithm that predicts labeled data. Regression predicts a continuous response variable: next quarter’s real GDP growth, next month’s stock return (previous discussion here). Classification predicts a categorical response variable: recession or recovery, next month’s return quintile (previous discussion here).
- Unsupervised learning: Any algorithm that summarizes or learns about unlabeled data, such as clustering or dimensionality reduction.
- The agent doesn’t have fixed training data. It discovers data via an unsupervised process and learns a policy.
- The rewards can be viewed as labels generated by a supervisor. But rewards aren’t always directly related to one specific prediction or action, or an individual row of data. If the agent shoots a target in Space Invaders, it has to figure out which action or sequence of actions, possibly over many preceding timesteps, contributed to the reward (the credit assignment problem).
- The agent’s interactions with the environment shape that environment, help determine what data the learning algorithm subsequently encounters, and generate a feedback loop. A Space Invaders agent changes the world by shooting targets; a self-driving car doesn’t modify the road, but its actions modify how other vehicles behave, and what environment the algorithm subsequently encounters.
- In supervised learning, the algorithm optimizes model parameters over training data to minimize a loss function, like mean squared error or cross-entropy. In reinforcement learning, the algorithm optimizes model parameters over the state space it encounters, to maximize the expected reward generated by the MDP over time.
- Business/Operations Research: Dynamic pricing of airline seats or other products to maximize profits under changing inventory, production, demand conditions.
- Economics: Optimal Fed interest rate policy to maintain full employment and low inflation in a dynamic economy.
- Engineering: Auto-pilots, spacecraft navigation, robots, and industrial automation.
- Psychology: Stimulus-response, positive and negative reinforcement.
- Neuroscience: The brain’s chemical reward loop, how children learn to walk and talk or catch a ball.
- Mathematics: Control theory, game theory, optimization.
Figure 4. Reinforcement Learning’s Relationships with Science and Engineering Disciplines
Figure 5. History Milestones in Reinforcement Learning
Deep Reinforcement LearningHow do we get from our simple Tic-Tac-Toe algorithm to an algorithm that can drive a car or trade a stock? Our table lookup is a linear value function approximator. Our linear value function approximator takes a board, represents it as a feature vector (with one one-hot feature for each possible board), and outputs a value that is a linear function of that feature vector, the value score for that board. We dot-multiply the one-hot feature vector by the lookup table values, and we get a linear value function which we use to choose the next move. We can swap that linear function for a nonlinear function, a neural network. When we do that, we get our first, very crude, deep reinforcement learning algorithm. Our new deep Q-learning (DQN) algorithm is:
- Initialize our neural network to random weights.
- Play a game with our opponent.
- Append each board we encounter into an n x 9 data array (our predictors) associated with the outcome of the game (our response). Our 9 predictors are the state of each square.
- Fit the neural network to the predictors (vectors representing boards) and responses (win/lose averages) we’ve seen. The neural network predicts the value of each board, instead of a table lookup/linear function.
- Go to 2), gather more data, and continue training to better predict the value of each board.
- The state is represented by a giant array of inputs from all the onboard cameras and sensors.
- The actions are: turn the steering wheel, accelerate, and brake.
- Positive rewards come from staying on the road and arriving safely at the destination, and negative rewards from breaking traffic laws or colliding.
- The real world provides state transitions.
- And we train a complex neural network to do everything involved in detecting and interpreting all the objects in the environment and navigating from point A to point B.
Reinforcement Learning for Trading: Simple Harmonic MotionIn a trading context, reinforcement learning allows us to use a market signal to create a profitable trading strategy.
- You need a better-than-random prediction to trade profitably. The signal can come from regression, predicting a continuous variable; or classification, predicting a discrete variable such as outperform/underperform (binary classification) or deciles (multinomial classification).
- The reward can be the raw return or risk-adjusted return (Sharpe).
- Reinforcement learning allows you to take a signal and learn a good policy (trading strategy) to maximize the reward (return or risk-adjusted return).
Figure 6. Simulated Stock Price Data (Simple Harmonic Motion)
- Initially, we set the price at 102 and price momentum at 0, and 100 as the price ‘anchor’.
- At each timestep, the price accelerates toward 100 by an amount proportional to the distance from 100. If the price is 102:
- the distance from 100 is 2.
- the new price momentum is old momentum – 2 * k.
- the new price is old price + momentum.
- 32 inputs: the last 16 market values (as deviations from 100 or intrinsic value), and the last 16 daily changes. 4
- 2 hidden layers of 16 units.
- 3 outputs of the probabilities of 0, 1, or 2 for short, flat, long respectively (softmax activation, i.e. multinomial classification).
- Initialize neural network θ values at random.
- Input the current state to the neural network and generate policy probabilities for short/flat/long.
- Sample from the generated probability distribution and take the sampled action.
- Generate a reward based on trading 1 share based on the action taken:
- When our chosen action is 2 (long), the next reward is the change in price at the next timestep.
- When our chosen action is 1 (flat), the next reward is 0.
- When our chosen action is 0 (short), the next reward is the opposite of the change in price at the next timestep.
- Save the observed state, action taken, and reward.
- Compute the gradient vector of each action probability with respect to the policy θ values (the neural network parameters)
- Use the probability gradient to compute the gradient of the expected return overall actions taken with respect to θ. (EV of gradient times return overall probability-weighted actions taken)
- Update each θ by the gradient of expected return w.r.t θ, times a learning rate. In other words, update the policy in the direction that increases the average return overall actions taken as much as possible:
- actions with above-average rewards become more probable
- actions with below-average rewards become less probable
Figure 7. Trading Agent Learning Progress (Simple Harmonic Motion)Finally, here is one sample episode with color-coding by short/flat/long and reward over the course of the episode.
Figure 8. Simulated Stock Trading Results Over One Episode (Simple Harmonic Motion)It’s not perfect, there are a couple of days where it strays from the ideal policy, but it’s pretty good!
Reinforcement Learning for Trading: Simple Harmonic Motion + Noise + DampingFor a more complex example, we take the simple harmonic motion dynamics and add noise + damping.
Figure 9. Simulated Stock Price Data (Simple Harmonic Motion + Noise + Damping)
Figure 10. Trading Agent Learning Progress (Simple Harmonic Motion + Noise + Damping)
Figure 11. Simulated Stock Trading Results Over One Episode (Simple Harmonic Motion + Noise + Damping)The Ritter paper uses an Ornstein-Uhlenbeck (OU) process. The OU process, like simple harmonic motion, has stronger mean reversion the farther away it is from the mean. But there is no momentum, so unlike simple harmonic motion, in the absence of noise it will not oscillate periodically but just revert asymptotically to the mean. Here is an OU process plus noise:
Figure 12. Simulated Stock Price Data (OU Process, Random Walk + Mean Reversion)
Figure 13. Trading Agent Learning Progress (OU Process, Random Walk + Mean Reversion)
Figure 14. Simulated Stock Trading Results Over One Episode (OU Process, Random Walk + Mean Reversion)The Ritter paper applies reinforcement learning to a multiple-stock portfolio. This is fairly straightforward from here by changing the input to be the states from multiple stocks, adding multiple outputs for multiple stocks, and computing the reward as a portfolio return. The Ritter paper also uses Sharpe ratio as the reward, and finds that the algorithm successfully optimizes it, which is a very nice result. The model empirically maximized risk-reward without knowing anything about the stock market dynamics, a covariance matrix, normality, or even how the reward is computed. But I think this is long enough and sufficient to illustrate the fundamentals of reinforcement learning, and I’ll stop here. The code for the trading experiments is here.
Advanced technical concepts and RL variationsThis table from Wikipedia lists some of the variations of reinforcement learning:
Monte Carlo vs. Temporal Difference and the strange-loopy bootstrapIn our policy gradient algorithm:
- We run an episode.
- We back up from the final timestep to the beginning using observed rewards to compute discounted rewards over the full episode.
- We train by ascending the policy gradient that improves standardized rewards.
- We use a value function that estimates the expected future reward from this state, following the current policy.
- We run one timestep.
- We back up one episode and compute the difference between the expected value we saw at the last timestep and the value after we took this action (the reward from this action, plus the discounted current expected value).
- This improvement is the advantage we got from that action. We train by ascending the θ gradient that improves the probability of the most advantaged actions.
Revisiting value-based v. policy-based methods
- If we do TD learning using only a state-value neural network function approximator, and our policy is to choose the action resulting in the best state-value, this is a Deep Q-Learning Network (DQN).
- If we use only a policy network, this is a Policy Gradient method.
- If we use a value network and a policy network and train the policy function separately so that it improves the value function as much as possible, this is Actor-Critic learning.
The exploration vs. exploitation tradeoff.When we do Q-learning, our policy is to choose the action with the best resulting state-value. However, there is a strong possibility that early in our training one action is always best in some part of the state space. Since we always choose the best action and never try other actions, our training never learns from situations where the other actions may be better. To avoid this, we perform ε-greedy exploration. Early on, we follow our policy say 60% of the time, and a random action 40% of the time. This allows the algorithm to experience the whole state/action space. 40% is the ε parameter. In practice since our policy network is random at the outset, we typically start with ε at 100% and gradually reduce it to a very small value.
On-policy vs. off-policy learningε-greedy algorithms are ‘off-policy’ learning since they sometimes act differently from the policy and train on the resulting outcomes, in contrast to algorithms which only take ‘on-policy’ actions. Policy gradient algorithms sample actions from a probability distribution. Bad actions never have a strictly zero probability, they just get less likely over time. So sampling implicitly trades off an exploration of new actions, vs. exploitation of the best actions.
TRPO and PPO, or how to avoid falling off a cliffFinally, we have noticed that sometimes training will fall off a cliff. Through the extreme nonlinearity of neural networks, a small gradient descent update may generate a large change in policy outcomes. Sometimes the policy is much worse and the training optimization has trouble climbing back up, like falling into a crevasse on the way up Mount Everest. One might avoid that with very small steps, i.e. a small learning rate, but then training takes forever. Two popular variations that address this issue are Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). Essentially they avoid or penalize updates that change the output policy too much in one update. I experimented with a few algorithms in the OpenAI gym, using Cartpole and Lunar Lander here. This is a good way to get started learning about RL, along with more resources listed below.
TakeawaysLike the man in the Molière play who discovers he’s been speaking in prose his whole life, you may have been doing reinforcement learning and optimal control your whole life without realizing it. Reinforcement learning is an important paradigm in machine learning. It can bridge a gap between predicting and doing, between description and prescription. And many markets and economic processes can be impacted by adaptive intelligent agents. In quant investing, sometimes one finds that a modest predictive R-squared, or a modest change in behavior, avoids the actions with worst outcomes, and leads to a large improvement in returns. At other times, one finds that a significant improvement R-squared offers no investment performance improvement. Why is that? Hypothetically, suppose you have a stock that in the long run yields a 5% annualized return. And the single best day each year is up 5%. Suppose you discover a classifier or regression that always gets that one day right, and is no better than random the rest of the time. With a fraction of a percent increase in accuracy or R-squared you almost doubled your expected return. Suppose you can also get a perfect prediction on all the days when the daily return is close to 0. On those days, it doesn’t matter if you are long or not. Under a perverse scenario, you could improve your forecast on every one of those days, and improve your predictive R-squared or classifier accuracy a lot. But if you do that, it won’t improve your performance. And if you then miss that one good trade, you worsen the investment outcome, possibly even while improving R-squared. Prediction and control are two different objectives. If you have a perfect prediction, then you always choose the correct action, and you have perfect control. But you only need a good enough prediction to select the correct action. Beyond that, any further accuracy is superfluous. And you can sometimes select the correct action at the most critical time and can get pretty good control outcomes, with only a modest predictive signal. 6 In the spirit of deep learning, reinforcement learning trains a complex neural network directly on the reward. Training directly on the reward to choose actions will tend to focus the algorithm on the situations where the choice of action makes the biggest difference. If you train for prediction first, and then use prediction for control, the training values any improvement in prediction equally, even if it doesn’t improve control. In deep learning, we don’t necessarily break down a complex engineering problem into more tractable components. Instead, we ask what complex model would be able to solve a complex problem, and train the model end-to-end for a long time on big data. Often, this yields emergent behavior and unreasonably effective models. The link between statistical prediction metrics and real-world performance can be elusive. But in general, reinforcement learning will learn to use whatever forecasting edge it has, to generate a maximum reward in a dynamic environment. How is reinforcement learning different from backtesting? Backtesting exhaustively searches a parameter space for a parameter vector that obtains the best out-of-sample performance. When we use reinforcement learning with a function approximator and gradient descent, we can use much more complex models where there are too many parameter combinations to backtest, but still efficiently explore the search space and possibly obtain a better result. Training complex models end-to-end simultaneously for prediction and control can result in complex emergent behavior that displays almost human-like intelligence, and adaptive behavior. There is a parallel between reinforcement learning and the adaptive market hypothesis of Andrew Lo. Markets may not be perfectly efficient at all times, but they tend to move in that direction via an evolutionary learning process based on experience. JPMorgan and others have reportedly implemented reinforcement learning to trade in the real world, see for instance this paper and (more readable) article. In another example, Igor Halperin used reinforcement learning to successfully model the return from options trading without any Black-Scholes formula or assumptions about log-normality, slippage, etc. Thesis:
- Reinforcement learning can learn complex economic decision-making in many cases better than humans.
- Reinforcement learning can develop concepts like how to maximize risk-reward without knowing the CAPM or Black-Scholes.
- Economic and finance theories can be tested empirically in silico by creating multi-agent reinforcement learning experiments where we just tell agents to maximize a reward and see what behaviors they learn.
- Therefore, we don’t need economists to develop fancy models, we don’t need traders to execute trading strategies, and intelligent agents will take over real-world economic decisionmaking. Agents will move from trading algos to setting prices on Uber and Amazon, to controlling the traffic lights, congestion pricing, and information flow in a transportation network to keep traffic moving.
- Reinforcement learning is very data-hungry or sample-inefficient, more suited to intraday trading, not lifetime 20-punchcard type problems.
- High model complexity makes interpretability challenging.
- It doesn’t always work. Reinforcement learning can get stuck at local optima or fall off a cliff. You have to take care to not just train on recent experience but also important but rare cases, the way pilots train for equipment failure and upset recovery.
- AI algorithms can be exploited. An adversarial sticker can make image recognition think a banana is a toaster or an adversarial temporary tattoo can defeat facial recognition. Google’s traffic algorithm can be gamed. Trading is an adversarial game against highly adaptive competitors. If a market maker algorithm trades on patterns, adversarial algorithms can learn to manipulate it, paint the tape to make it do bad trades.
Further reading:Papers and blog posts:
- Csaba Szepesvári, Algorithms for Reinforcement Learning
- Andrej Karpathy: Deep Reinforcement Learning: Pong from Pixels
- OpenAI: Key Papers in Deep RL
- Yuxi Li, Deep Reinforcement Learning
- Thomas G. Fischer, Reinforcement Learning in Financial Markets – A Survey
- Gordon Ritter, Machine Learning for Trading
- Heaton, J.B. and Polson, Nick and Witte, Jan, Deep Learning for Finance: Deep Portfolios
- Kaiqing Zhang, Zhuoran Yang, Tamer Başar, Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms
- Afshin OroojlooyJadid and Davood Hajinezhad, A Review of Cooperative Multi-Agent Deep Reinforcement Learning
- Alex Irpan, Deep Reinforcement Learning Doesn’t Work Yet
- Python implementations of RL algorithms: Lilian Weng; Denny Britz; Arthur Juliani
- Contests: Google Research Football; Pommerman; MineRL
- Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
- Foundations of Deep Reinforcement Learning: Theory and Practice in Python, Laura Graesser, Wah Loon Keng
- Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville
- Artificial Intelligence: A Modern Approach, Stuart Russell, Peter Norvig
- Even if they are isomorphic to boards it has seen. There are really only 3 starting moves, board center, corner, center side. Flipping or rotating the board shouldn’t change the value of a position or how to play it. ↩
- Or ot, the observable part of the state, in the event the state is not fully observable, and there is some hidden state that determines the evolution of the game, such as face-down cards. ↩
- For instance, instead of a table with all possible Tic-Tac-Toe boards we could use a table mapping (board, action) pairs to values. Then we wouldn’t be modeling internally what happens after a move, i.e. several (board, action) pairs arrive at the same board. We would just evaluate state, action pairs directly without any internal model. That would work pretty much the same, it would just be a bigger table and take longer to train. ↩
- For our simple harmonic motion with no noise, one input of the last change is sufficient. But a more complex model with 16 days of levels and changes can be applied to a more complex example. ↩
- In this simple model, we can use a large discount because each action only influences the next trading day. In a more complex environment where the current action can impact rewards far in the future, we would want to take those rewards into account, and we would use a smaller discount. ↩
- The context here is a price-taking trading algorithm too small to influence the overall market. Of course, an algorithm playing poker or Atari or Google Football tries its utmost to influence the course of the game to its advantage. The most important difference between prediction and control is how the algorithm alters the dynamics of the game, and the resulting feedback loop. A reinforcement learning trading algorithm will tend to reinforce any actions that earn a reward, including influencing the market, painting the tape, or colluding with other market participants. ↩