once one of my pups found half a roast chicken in the corner of a parking lot and we had to visit that exact same corner every day for about fifty years because for dogs hope springs eternal when it comes to half a roast chicken – darth

Properly used, positive reinforcement is extremely powerful. – B. F. Skinner

Tic-Tac-Toe is a simple game. If both sides play perfectly, neither can win. But if one plays imperfectly, the other can exploit the flaws in the other’s strategy. Does that sound a little like trading? Reinforcement learning is a machine learning paradigm that can learn behavior to achieve maximum reward in complex dynamic environments, as simple as Tic-Tac-Toe, or as complex as Go, and options trading. In this post, we will try to explain what reinforcement learning is, share code to apply it, and references to learn more about it. First, we’ll learn a simple algorithm to play Tic-Tac-Toe, then learn to trade a non-random price series. Finally, we’ll talk about how reinforcement learning can master complex financial concepts like option pricing and optimal diversification. Let’s dive in!

## What Is Reinforcement Learning? A Tic-Tac-Toe Example

Here’s a simple algorithm that will learn an exploitive Tic-Tac-Toe strategy, and adapt over time if its opponent learns:
1. Make a big table of all possible Tic-Tac-Toe boards.
2. Initialize the table to assign a value of 0 to each board, 1.0 where X has won, -1.0 where O has won.
3. Play with your opponent. At each move, pick the best available move in your table, or if several are tied, pick one at random. Occasionally, make a move at random just to make sure you explore the whole state space, and to keep your opponent on their toes.
4. After each game, back up through all the boards that were played.
Update the value table as follows:
• When X wins, update each board’s value part of the way to 1.
• When O wins, update part of the way to -1.
• When they tie, update part of the way to 0.
This is a profoundly dumb algorithm in the finest sense. It knows almost nothing about the dynamics of Tic-Tac-Toe, but it works. It can’t reason about the game. It needs a lot of training. It can’t generalize to boards it hasn’t seen. 1 It doesn’t learn a globally optimal strategy, it just tries to find whatever beats its opponent. But over time, this algorithm learns, it exploits flaws in its opponent’s strategy, and if the opponent changes tactics, it adapts. This is reinforcement learning. More sophisticated reinforcement learning algorithms enable robots to walk on four or two legsdriverless cars to drive, computers to play Atari and poker and Go and StarCraft and DOTA and online soccer, in some cases better than humans. Here is some sample Tic-Tac-Toe code. In this post, we’ll extend the Tic-Tac-Toe example to deep reinforcement learning, and build a reinforcement learning trading robot.

## Reinforcement Learning Concepts

But first, let’s dig a little deeper into how reinforcement learning in general works, its components, and variations.

Figure 1. Markov Decision Process (MDP)

This figure and a few more below are from the lectures of David Silver, a leading reinforcement learning researcher known for the AlphaGo project, among others.
1. At time t, the agent observes the environment state st (the Tic-Tac-Toe board). 2
2. From the set of available actions (the open board squares), the agent takes action at (the best move).
3. The environment updates at the next timestep t+1 to a new state st+1. In Tic-Tac-Toe this is the board resulting from the opponent’s move. In a complex environment like a car on a road, the new state may be partly determined by the agent’s actions (you turn left) and partly by visible or hidden complexities in the environment (a dog runs into the road). And the new state may be non-deterministic, things can occur randomly, with probabilities dependent on the visible and hidden state and the actions of the agent.
4. The environment generates a reward. In Tic-Tac-Toe you get a reward when you win, lose, or draw. In Space Invaders, you win points at various times when you hit different targets. When training a self-driving car, machine learning engineers design rewards for staying on the road, getting to the destination, including negative rewards for e.g. collisions.
The technical name for this setting is a Markov Decision Process (MDP).
• It’s based on the Markov chain model, which has states and probabilities of transitions between states.
• The vanilla Markov chain is extended with actions: at each state the agent can choose an action that influences the transition probabilities. The transition probabilities are a function not just of st but of (st, at).
• Each state transition is associated with a reward (possibly 0).
• Finally, the agent chooses actions based on states, using a policy function π(st) = at.
Reinforcement learning always has an environment with states, actions, transitions between states, rewards, and an agent that acts according to policy. The agent proceeds through a cycle of observing the state, acting, getting a reward and repeating forever, or until some terminal state is reached.

## Policy Function Concepts

The agent’s policy function chooses the best action based on the environment state. It may have the following components:
• Model – An internal representation of the environment. Our Tic-Tac-Toe agent has a model of the board, and it knows some state-action pairs result in the same state as other state-action pairs. A fully model-based algorithm explicitly models the full MDP with all transition probabilities, which our Tic-Tac-Toe algorithm doesn’t do. Other agents may be model-free. They choose actions without explicitly storing an internal model of the state or modeling state transitions. The model of the environment is implicit in the policy function. 3
• State value function approximator – A way to estimate the value of a state (our big table mapping boards to values).
• State-action value function approximator – A way to estimate the value of an action in a given state, i.e. a state-action pair, commonly termed a Q-value function.
Just as there are many algorithms for regression or classification, there are many reinforcement learning architectures, and new approaches are constantly emerging. Based on which components a reinforcement learning algorithm uses to generate the workflow illustrated in Figure 1, it can be categorized by type.

Figure 2. Variations of Reinforcement Learning

All reinforcement learning variations learn using a similar workflow:
1. Initialize the algorithm with naive, typically random, policy parameters.
2. Using the policy, take actions, observe states before and after actions, gather experience rewards.
3. Fit a model that improves the policy.
4. Go to 2) and iterate, collecting more experience with the improved policy, and continuing to improve it.
As we continue to iterate, we improve the algorithm.

Figure 3. Reinforcement Learning Workflow

## Reinforcement Learning In Context

In a previous post we discussed the differences between paradigms of machine learning:
• Supervised learning: Any algorithm that predicts labeled data. Regression predicts a continuous response variable: next quarter’s real GDP growth, next month’s stock return (previous discussion here). Classification predicts a categorical response variable: recession or recovery, next month’s return quintile (previous discussion here).
• Unsupervised learning: Any algorithm that summarizes or learns about unlabeled data, such as clustering or dimensionality reduction.
Any data set is either labeled or unlabeled. So between supervised and unsupervised, that must cover all the bases, right? It’s like those two books, What They Teach You At Harvard Business School and What They Don’t Teach You At Harvard Business School. Between the two of them, they must cover the sum of all human knowledge, right? Nevertheless, reinforcement learning is considered the third major machine learning paradigm. Consider the Tic-Tac-Toe robot:
• The agent doesn’t have fixed training data. It discovers data via an unsupervised process and learns a policy.
• The rewards can be viewed as labels generated by a supervisor. But rewards aren’t always directly related to one specific prediction or action, or an individual row of data. If the agent shoots a target in Space Invaders, it has to figure out which action or sequence of actions, possibly over many preceding timesteps, contributed to the reward (the credit assignment problem).
• The agent’s interactions with the environment shape that environment, help determine what data the learning algorithm subsequently encounters, and generate a feedback loop. A Space Invaders agent changes the world by shooting targets; a self-driving car doesn’t modify the road, but its actions modify how other vehicles behave, and what environment the algorithm subsequently encounters.
• In supervised learning, the algorithm optimizes model parameters over training data to minimize a loss function, like mean squared error or cross-entropy. In reinforcement learning, the algorithm optimizes model parameters over the state space it encounters, to maximize the expected reward generated by the MDP over time.
In reinforcement learning, we move beyond prediction to control. Reinforcement learning can be viewed as the application of supervised machine learning to a larger problem of optimal control. We apply supervised prediction methods such as classification and regression to choose the best action to take within the action space and learn behavior policies to maximize reward in a complex dynamic environment. Many disciplines have encountered settings like this and developed methodologies to address them:
• Business/Operations Research: Dynamic pricing of airline seats or other products to maximize profits under changing inventory, production, demand conditions.
• Economics: Optimal Fed interest rate policy to maintain full employment and low inflation in a dynamic economy.
• Engineering: Auto-pilots, spacecraft navigation, robots, and industrial automation.
• Psychology: Stimulus-response, positive and negative reinforcement.
• Neuroscience: The brain’s chemical reward loop, how children learn to walk and talk or catch a ball.
• Mathematics: Control theory, game theory, optimization.

Figure 4. Reinforcement Learning’s Relationships with Science and Engineering Disciplines

Figure 5. History Milestones in Reinforcement Learning

## Deep Reinforcement Learning

How do we get from our simple Tic-Tac-Toe algorithm to an algorithm that can drive a car or trade a stock? Our table lookup is a linear value function approximator. Our linear value function approximator takes a board, represents it as a feature vector (with one one-hot feature for each possible board), and outputs a value that is a linear function of that feature vector, the value score for that board. We dot-multiply the one-hot feature vector by the lookup table values, and we get a linear value function which we use to choose the next move. We can swap that linear function for a nonlinear function, a neural network. When we do that, we get our first, very crude, deep reinforcement learning algorithm. Our new deep Q-learning (DQN) algorithm is:
1. Initialize our neural network to random weights.
2. Play a game with our opponent.
3. Append each board we encounter into an n x 9 data array (our predictors) associated with the outcome of the game (our response). Our 9 predictors are the state of each square.
4. Fit the neural network to the predictors (vectors representing boards) and responses (win/lose averages) we’ve seen. The neural network predicts the value of each board, instead of a table lookup/linear function.
5. Go to 2), gather more data, and continue training to better predict the value of each board.
The more the algorithm plays, the more accurate the function approximator gets, and the better it plays. This algorithm will learn to play, although it takes a long time to train and makes our initial brute force method even more inefficient. (see code). But in a nutshell, that is how a self-driving car could work.
• The state is represented by a giant array of inputs from all the onboard cameras and sensors.
• The actions are: turn the steering wheel, accelerate, and brake.
• Positive rewards come from staying on the road and arriving safely at the destination, and negative rewards from breaking traffic laws or colliding.
• The real world provides state transitions.
• And we train a complex neural network to do everything involved in detecting and interpreting all the objects in the environment and navigating from point A to point B.
Table lookup cannot scale to high dimensional or continuous action and state spaces. And a linear function approximator can’t learn nonlinear behavior. With deep neural networks, reinforcement learning algorithms can learn complex emergent behavior.

## Reinforcement Learning for Trading: Simple Harmonic Motion

In a trading context, reinforcement learning allows us to use a market signal to create a profitable trading strategy.
• You need a better-than-random prediction to trade profitably. The signal can come from regression, predicting a continuous variable; or classification, predicting a discrete variable such as outperform/underperform (binary classification) or deciles (multinomial classification).
• The reward can be the raw return or risk-adjusted return (Sharpe).
• Reinforcement learning allows you to take a signal and learn a good policy (trading strategy) to maximize the reward (return or risk-adjusted return).
Here’s a simple example showing how one can trade using reinforcement learning. This approach is inspired by the paper “Machine Learning For Trading” by Gordon Ritter. We use simple simulated market data as a stepping stone to more complex trading environments. Let’s create a market price time series as a simple sine wave.

Figure 6. Simulated Stock Price Data (Simple Harmonic Motion)

• Initially, we set the price at 102 and price momentum at 0, and 100 as the price ‘anchor’.
• At each timestep, the price accelerates toward 100 by an amount proportional to the distance from 100. If the price is 102:
• the distance from 100 is 2.
• the new price momentum is old momentum – 2 * k.
• the new price is old price + momentum.
This is a simple harmonic motion from physics 101, it describes the oscillation of a spring (Hooke’s law), a pendulum under small oscillations, and many other periodic systems. We can view this as an extremely simplified value/momentum model. 100 is the intrinsic value the stock tends toward. The farther away from intrinsic value, the stronger the acceleration back toward intrinsic value. And momentum means that if the stock is trending up or down, the trend takes time to be reversed. To trade this stock, we use the REINFORCE algorithm, which is a Monte Carlo policy gradient-based method. (We can also use Q-learning, but policy gradient seems to train faster/work better.) We simulate many episodes of 1000 training days, observe the outcomes, and train our policy after each episode. 1) Initialize a neural network to choose actions based on the state.
• 32 inputs: the last 16 market values (as deviations from 100 or intrinsic value), and the last 16 daily changes. 4
• 2 hidden layers of 16 units.
• 3 outputs of the probabilities of 0, 1, or 2 for short, flat, long respectively (softmax activation, i.e. multinomial classification).
• Initialize neural network θ values at random.
2) Generate one episode trajectory using the current policy. At each timestep:
• Input the current state to the neural network and generate policy probabilities for short/flat/long.
• Sample from the generated probability distribution and take the sampled action.
• Generate a reward based on trading 1 share based on the action taken:
• When our chosen action is 2 (long), the next reward is the change in price at the next timestep.
• When our chosen action is 1 (flat), the next reward is 0.
• When our chosen action is 0 (short), the next reward is the opposite of the change in price at the next timestep.
• Save the observed state, action taken, and reward.
3) At the end of the trajectory, back up and compute a discounted future reward observed at each timestep using the action taken and following the current policy to the end of the episode. 5 Standardize the returns (discounted future rewards) by subtracting the mean and dividing by the standard deviation. 4) Update the policy:
• Compute the gradient vector of each action probability with respect to the policy θ values (the neural network parameters)
• Use the probability gradient to compute the gradient of the expected return overall actions taken with respect to θ. (EV of gradient times return overall probability-weighted actions taken)
• Update each θ by the gradient of expected return w.r.t θ, times a learning rate. In other words, update the policy in the direction that increases the average return overall actions taken as much as possible:
• actions with above-average rewards become more probable
• actions with below-average rewards become less probable
5) return to 2) and iterate until the policy stops improving. Here is a chart of total reward as we train over 2000 episodes.

Figure 7. Trading Agent Learning Progress (Simple Harmonic Motion)

Finally, here is one sample episode with color-coding by short/flat/long and reward over the course of the episode.

Figure 8. Simulated Stock Trading Results Over One Episode (Simple Harmonic Motion)

It’s not perfect, there are a couple of days where it strays from the ideal policy, but it’s pretty good!

## Reinforcement Learning for Trading: Simple Harmonic Motion + Noise + Damping

For a more complex example, we take the simple harmonic motion dynamics and add noise + damping.

Figure 9. Simulated Stock Price Data (Simple Harmonic Motion + Noise + Damping)

Figure 10. Trading Agent Learning Progress (Simple Harmonic Motion + Noise + Damping)

Figure 11. Simulated Stock Trading Results Over One Episode (Simple Harmonic Motion + Noise + Damping)

The Ritter paper uses an Ornstein-Uhlenbeck (OU) process. The OU process, like simple harmonic motion, has stronger mean reversion the farther away it is from the mean. But there is no momentum, so unlike simple harmonic motion, in the absence of noise it will not oscillate periodically but just revert asymptotically to the mean. Here is an OU process plus noise:

Figure 12. Simulated Stock Price Data (OU Process, Random Walk + Mean Reversion)

Figure 13. Trading Agent Learning Progress (OU Process, Random Walk + Mean Reversion)

Figure 14. Simulated Stock Trading Results Over One Episode (OU Process, Random Walk + Mean Reversion)

The Ritter paper applies reinforcement learning to a multiple-stock portfolio. This is fairly straightforward from here by changing the input to be the states from multiple stocks, adding multiple outputs for multiple stocks, and computing the reward as a portfolio return. The Ritter paper also uses Sharpe ratio as the reward, and finds that the algorithm successfully optimizes it, which is a very nice result. The model empirically maximized risk-reward without knowing anything about the stock market dynamics, a covariance matrix, normality, or even how the reward is computed. But I think this is long enough and sufficient to illustrate the fundamentals of reinforcement learning, and I’ll stop here. The code for the trading experiments is here.

## Advanced technical concepts and RL variations

This table from Wikipedia lists some of the variations of reinforcement learning:

### Monte Carlo vs. Temporal Difference and the strange-loopy bootstrap

• We run an episode.
• We back up from the final timestep to the beginning using observed rewards to compute discounted rewards over the full episode.
• We train by ascending the policy gradient that improves standardized rewards.
Sometimes we don’t have short episodes, we have a process that continues forever, or for a very long time. A self-driving car algorithm might have trips with millions of timesteps. An alternative to Monte Carlo training, which runs a full episode and backs up to assign rewards, is temporal difference learning (TD):
• We use a value function that estimates the expected future reward from this state, following the current policy.
• We run one timestep.
• We back up one episode and compute the difference between the expected value we saw at the last timestep and the value after we took this action (the reward from this action, plus the discounted current expected value).
• This improvement is the advantage we got from that action. We train by ascending the θ gradient that improves the probability of the most advantaged actions.
This is a slightly strange magical recursive loop. At the outset, our policy has a random θ. So we are training on the improvement from our fairly random value to the slightly less random value at the next timestep where we know one reward. Nevertheless, as we do this many, many times, the influence of rewards further in the future filters back one step at a time and we bootstrap to a better and better policy. When we train on the advantage gained using our policy between now and the next step, it’s called TD(0). We can also train on the improvement 2 steps into the future, and that’s TD(1), and we can do TD(2) and so on. If we do TD(∞) we are continuing through the end of the episode, however long it may be, and we are back to Monte Carlo learning. Finally, there’s a temporal difference algorithm called TD(λ) where we effectively use an exponential moving average of all the TD terms. Setting λ to 0 is effectively TD(0), setting λ to 1 is effectively Monte Carlo, and calibrating λ determines how far into the future we want to peek. With TD(λ), eligibility traces are used to keep track of how much credit to assign to the current action for rewards many steps into the future.

### Revisiting value-based v. policy-based methods

• If we do TD learning using only a state-value neural network function approximator, and our policy is to choose the action resulting in the best state-value, this is a Deep Q-Learning Network (DQN).
• If we use only a policy network, this is a Policy Gradient method.
• If we use a value network and a policy network and train the policy function separately so that it improves the value function as much as possible, this is Actor-Critic learning.

### The exploration vs. exploitation tradeoff.

When we do Q-learning, our policy is to choose the action with the best resulting state-value. However, there is a strong possibility that early in our training one action is always best in some part of the state space. Since we always choose the best action and never try other actions, our training never learns from situations where the other actions may be better. To avoid this, we perform ε-greedy exploration. Early on, we follow our policy say 60% of the time, and a random action 40% of the time. This allows the algorithm to experience the whole state/action space. 40% is the ε parameter. In practice since our policy network is random at the outset, we typically start with ε at 100% and gradually reduce it to a very small value.

### On-policy vs. off-policy learning

ε-greedy algorithms are ‘off-policy’ learning since they sometimes act differently from the policy and train on the resulting outcomes, in contrast to algorithms which only take ‘on-policy’ actions. Policy gradient algorithms sample actions from a probability distribution. Bad actions never have a strictly zero probability, they just get less likely over time. So sampling implicitly trades off an exploration of new actions, vs. exploitation of the best actions.

### TRPO and PPO, or how to avoid falling off a cliff

Finally, we have noticed that sometimes training will fall off a cliff. Through the extreme nonlinearity of neural networks, a small gradient descent update may generate a large change in policy outcomes. Sometimes the policy is much worse and the training optimization has trouble climbing back up, like falling into a crevasse on the way up Mount Everest. One might avoid that with very small steps, i.e. a small learning rate, but then training takes forever. Two popular variations that address this issue are Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). Essentially they avoid or penalize updates that change the output policy too much in one update. I experimented with a few algorithms in the OpenAI gym, using Cartpole and Lunar Lander here. This is a good way to get started learning about RL, along with more resources listed below.

## Takeaways

• Reinforcement learning can learn complex economic decision-making in many cases better than humans.
• Reinforcement learning can develop concepts like how to maximize risk-reward without knowing the CAPM or Black-Scholes.
• Economic and finance theories can be tested empirically in silico by creating multi-agent reinforcement learning experiments where we just tell agents to maximize a reward and see what behaviors they learn.
• Therefore, we don’t need economists to develop fancy models, we don’t need traders to execute trading strategies, and intelligent agents will take over real-world economic decisionmaking. Agents will move from trading algos to setting prices on Uber and Amazon, to controlling the traffic lights, congestion pricing, and information flow in a transportation network to keep traffic moving.
Antithesis:
• Reinforcement learning is very data-hungry or sample-inefficient, more suited to intraday trading, not lifetime 20-punchcard type problems.
• High model complexity makes interpretability challenging.
• It doesn’t always work. Reinforcement learning can get stuck at local optima or fall off a cliff. You have to take care to not just train on recent experience but also important but rare cases, the way pilots train for equipment failure and upset recovery.
• AI algorithms can be exploited. An adversarial sticker can make image recognition think a banana is a toaster or an adversarial temporary tattoo can defeat facial recognition. Google’s traffic algorithm can be gamed. Trading is an adversarial game against highly adaptive competitors. If a market maker algorithm trades on patterns, adversarial algorithms can learn to manipulate it, paint the tape to make it do bad trades.
Synthesis: It’s hard to tell what will happen but it will certainly be interesting. When many agents interact in a market and try to learn an optimal policy at the same time, the environment becomes radically non-stationary. If a self-driving car crashes on an empty track in a single-agent environment, it has to learn which of its actions caused the crash. But in the multi-agent setting, if it crashes into another car, it has to model why the other car did what it did and adust behavior accordingly. With multiple agents interacting, they have to ask themselves not only how much reward to attribute to which past actions they took, but how much credit or blame to attribute to each agent. You get feedback loops and unexpected emergent behavior.In real-world economics or trading settings, maybe agents adapt to each other and arrive at a new, more efficient price equilibrium. Maybe they create extreme algorithm-induced oscillations between multiple equilibria. Maybe they learn to collude? We don’t know. It depends on the market structure, the diversity of the robot population, etc. What we do know is it becomes an NP-hard problem to find a multi-agent policy with optimal outcomes. I’m not sure self-driving vehicles on the streets of New York or New Delhi are likely in the near future, without protected rights of way for self-driving vehicles and strong enforcement. If pedestrians know that the self-driving car is always going to stop for them no matter what, they will learn to just cross at the red light, never mind traffic. They can even wear a stop sign on a T-shirt. It’s not a question of how good the self-driving technology is, it’s a question of game theory. Knowing that the other driver is a fallible fellow human, who may be angry and honk and give you the finger, and at worst may be on a cell phone and not even see you, just leads to different outcomes. Reinforcement learning and intelligent agents may offer economics a path out of the DSGE/microfoundations conundrum. Would you rather have a model that empirically works pretty well in the current real-world regime; or a less accurate model based on a strong theoretical foundation and therefore more robust to regime shifts? By complexifying the representative agent, maybe you get more predictive or at least more interesting micro-founded models. You can ask questions like, what conditions are necessary and sufficient for smart Prisoners Dilemma agents to move from the naive Nash equilibrium where they are better off always defecting, to a meta-Nash equilibrium where every agent has no incentive to deviate from the policy of always cooperating. Economics and finance involve thinking about how systems of interacting, constantly adapting, intelligent agents behave. It seems likely more of them will be non-human, and the whole world will be a lab. “We shape our tools and thereafter our tools shape us” – John Culkin