Deconstructing AlphaZero Part 3: How You Gonna Act Like That
The defining feature of an agent is that it makes choices about what actions to take, given its current understanding of the world.
There are many good introductions to the topic of agency, and how agents, including people, choose to act. Some of these are quite popular, like this introduction to the action selection process for reinforcement learning agents from American R&B singer and actor Tyrese, which has almost 300 million views.
(By the way, getting ahead of myself a bit, but once you start building your own agents and you can’t figure out why the dumbass can’t learn you may find that muttering “how you gonna act ... like … that” under your breath from time to time helps soothe the pain… if you know, you know.)
It is cool to see so many people interested in agency. But maybe that shouldn’t be surprising. Who you are is defined by the actions you choose to take. Questions about why we do what we do are meaningful to everyone. Like the great 21st-century philosopher Urban Meyer is quoted as saying, “We are not measured by our intentions, but by our actions.” Worth meditating on that one.
What is the mechanism by which you take the actions you do? Is there some kind of thing inside you that’s ‘the you’ that makes choices? If so where does it live? What’s it made out of? No one knows for sure. But the answer is intimately connected to questions of selfhood, ethics, the meaning of life, spiritual belief, consciousness, first-person experience, and free will.
I find most of the discourse on these topics vague, unsatisfying, and often built on atrocious misconceptions (abusers of quantum mechanics, I’m looking at you).
But building agents that exist in the real world, designed to act on it and change things to their own benefit, is a real tangible thing. Anyone can do it (I’m going to show you how!). It is essentially free (although you do need some computing horsepower to make your agents smart, but if you’re OK with a dumb agent, even a cheap laptop can do the job). And you can directly test your ideas (and mine!) to see if they are valid, and if so, how or if they intersect with all those deep questions.
Agents either do a thing or they don’t. There is no need for a PhD in philosophy to ask questions about an agent’s free will, or lack thereof. You can personally go through every line of code. There is no room for bad ideas to hide. You don’t need to take my (or anyone’s) word for any of this. You can do it yourself, and see.
With that context, let’s dive in, and reveal how an AlphaZero agent is gonna act… like… that.
First, The AlphaZero Neural Net
When I first introduced AlphaZero I mentioned that there are two kinds of thought processes it uses to make decisions — one ‘slow-thinking’ and deliberative, and the other ‘fast-thinking’ and intuitive. The fast-thinking process uses a neural net to make intuitive guesses for both the policy and the value of the current state.
Before I describe how an AlphaZero agent selects an action, we need to introduce this neural net, as it’s an integral part of that process.
Define a neural net that takes as input any state s and outputs two things: (1) an estimate of the state’s value, which is a single real number V(s) in the range -1 to +1, and (2) a policy vector P(s, a), which is interpreted as the prior probability of selecting any atomic action a from state s. The prior probabilities represent the agent’s initial belief or preference about the likelihood of selecting each possible action from a given state before any additional search or evaluation is performed.
This is why I think of the neural net as generating the agent’s intuition. It is making a snap judgment about the value of the current state and what actions to take from it, without any deliberation or further thought.
Note: usually policies are written with a π, but I’m using P here as the actual policy of the AlphaZero agent is not the same as the policy the neural net suggests.
The ‘your choice’ part of the neural net above depends on what you expect the natural structure of your inputs to be. If your situation is highly visual (e.g. your inputs are game boards where specific local patterns are important, like in Go), convolutional layers make sense. If you don’t know what to put here, you can just slap in a couple of dense layers and everything should work just fine (if it’s worth doing, you can easily change this later to try to improve the neural net’s performance).
While the dive into how to implement all of this stuff will come later, I am using Keras with Pytorch as a backend, and implementing even sophisticated neural nets is a matter of a couple dozen lines of code at most. The community has come a long way over the past decade or so — modern neural nets are probably as easy as they possibly could be to set up and use.
Slow, Deliberative, Logical Thinking: MCTS
When an AlphaZero agent decides what action to take from the state it’s in, it shouldn’t just listen to the neural net’s advice (i.e. using P(s, a) to choose an action a from state s), because, at least initially, the neural net is pretty stupid. It needs to learn to make good decisions first! We need to temper our agent’s intuitive enthusiasm with some real-world experience so that our intuition gradually gets better.
To do this, the action-taking decision when the agent is playing against itself and learning about its world is going to be a slow, deliberative, logical process, grounded in the actual win/loss results of the agent’s decisions. We will use our impulsive neural net to help though, trying to combine the strengths of each. Here’s how this works.
Each time our agent is asked to select an action from whatever current state s it is in, we do M simulations of possible futures. When choosing M, set it to be much larger than the typical number of moves played in a typical game. The industrial grade set-ups DeepMind uses set M=800-1,600 for chess and Go; for Tangled, we should choose M >> 2+E (the number of moves to complete a two-player Tangled game on a graph with E edges)).
(As an aside, in situations where there isn’t a natural ‘end’ to an episode, I think this process still works, as long as there is actual real reward being generated along the way. M should be set to be much larger than the search depth typically required to provide reward from the environment.)
(As another aside, what I’m describing here requires that we know what state we end up with after we take an action from the current state. I’m pretty sure this whole thing also works if you make this probabilistic (after taking action a from state s you could end up in a bunch of different possible end states with some probability), but in the specific case I’m describing here, actions have a deterministic outcome — taking action a from state s always leads to the same state s’.)
Each simulation starts in the current state s, which we call the root node, and finishes in a leaf node. For MCTS, we will use ‘node’ to refer to a state and a bunch of information relevant to that state.
A leaf node is either a node that has not been visited yet in the current search, or a terminal (end-game) state.
M counts the number of times an as-yet unvisited node is added to the search tree plus the number of times a terminal state is re-visited. If we think of what this number means compared to some other numbers, recall that (1) the number of possible states you could potentially visit is unfathomably astronomically large, and (2) the depth of the search tree is just the number of moves before the episode/game ends, which is usually pretty small. So think of the tree of possible nodes as being stupendously wide but not all that deep. Since we only have a tiny M (any number is tiny compared to the number of possible states), we must make sure we are making good choices about which paths we take through this enormous space.
For each node (labeled by s) visited in the current search tree, we store the following numbers:
N(s): The number of times state s has been visited
N(s, a): The number of times action a has been taken from state s
W(s, a): The total summed value V obtained from every time we’ve taken action a from state s in the current search tree
Q(s, a): The mean value V obtained from taking action a from state s, equal to W(s, a) / N(s, a)
P(s, a): The prior probability of selecting action a from state s; this is the output of the neural net on state s, except for when s is a root node, where we add Dirichlet noise (I’ll explain this later) to the neural net output to encourage exploration from the root node
With this terminology and numbers, we are ready to describe the MCTS algorithm. Here it is!
Objective: Given a root state sᵣ, generate a policy π(sᵣ, a) — a list of probabilities of picking action a from state sᵣ
The MCTS Algorithm
Do this M times, each time restarting from the root state sᵣ:
Of all the potential legal actions that could be taken from the current state, select one using the Selection Algorithm defined below, and move to the state that action brings you to
If you have already visited this new state, go back to step 1 and repeat it using the new state as the current state. If you have not visited it yet, and it’s not a terminal state, call this a leaf state. Send the leaf state into the neural net, and initialize the state’s prior probabilities P(s, a) and value V(s) to whatever the neural net outputs, adding Dirichlet noise to P(s, a) if s is a root state. If it is a terminal state, adjudicate it, and set the value V(s) to the game outcome (player one wins = +1, draw = +0.01, player two wins = -1) for a two-player game with player one taking action a — (draws get assigned a small positive value); recall that V(s) => -V(s) for the competing agent (player two’s values are the negative of player one’s values) if your situation is a zero-sum game
Backpropagate the value V(s) of the leaf or terminal state to the state that led to it, being careful with minus signs; increment the number of times each state-action pair N(s, a) was visited along the path from the root node to the leaf or terminal node; adjust N(s), W(s, a), and Q(s, a) for all the states and actions along this path
After these M simulations are done, we take the quantity N(sᵣ, a) and generate a policy using it and a parameter we will call temperature, which we’ll denote 𝜏, in the following way. Define the probability of selecting action a from state sᵣ to be
If 𝜏=0, the policy selects the action that has been taken most often (the action with the largest number of visits). If 𝜏>0, the probabilities of the other actions go up, and in the limit where 𝜏 is large, the probability distribution flattens so that all actions are equally likely. The addition of this temperature parameter is an additional tunable way to select between exploitation (𝜏=0) and exploration (𝜏>0), at the level of the MCTS policy.
In practice, a strategy that’s often used is that when the system is learning from playing against itself, the temperature is set to something that encourages exploration and diversity (say 𝜏=1) early in the game, and then later on, somewhere in the mid to late-game, it is switched to 𝜏=0 — which makes the move selection the best possible given what the system has learned so far.
When the agent is actually being used, we always set 𝜏=0, as once the system has gotten good, we want to always exploit its best strategies and not do any exploration at this stage.
The Selection Algorithm
Step 1 of the above process is critical because the number of states the algorithm could potentially try is mind-bogglingly enormous, but we are only going to visit a small number of them in our M simulations. We have to be clever in how we select states to include in the tree search.
As is the case in all reinforcement learning algorithms, we need to balance two competing forces. One is exploration — we want to try new things to see if we can find something better than what we already know about. The other is exploitation — if we already have a good strategy, we want to keep doing that, because trying something new is scary… and could be worse! By combining both, we can try to use good strategies we’ve found and also explore enough to have a good chance of finding something better.
People have spent a lot of effort analyzing different strategies for balancing exploration and exploitation, which I won’t revisit here. But to get a flavor for what’s going on, imagine you are in a casino and you have the choice of playing a bunch of slot machines. I tell you in advance that one of them pays off better than the others, but I don’t tell you which. I give you say 100 plays and your job is to make as much money as you can. If you knew the best machine, you’d just play that one 100 times! But you don’t know which is the best. So what do you do? This is called the multi-arm bandit problem. It’s one of the best-analyzed situations in figuring out how to think about the trade-offs between exploration (trying a new machine) and exploitation (sticking to the best machine we know about so far).
An approach to finding a good strategy for this problem is called the Upper Confidence Bound 1, or UCB1, algorithm (this site has a description of the multi-arm bandit problem, some fun interactive graphics, and a good description of two policies we could try for it including UCB1). The UCB1 philosophy is ‘optimism in the face of uncertainty’ — if we are unsure what action to take, take the action that might give us the best payoff given what we’ve seen already. If we chose wrong, this should show up in our results, in which case we can switch to some other choice that is the new potentially best choice.
In MCTS, we have a variant of this problem, where every node in the game tree is a multi-arm bandit (each possible action from each possible state is a different slot machine we could play)! From the perspective of the current node, this means that the expected payoffs of taking each action shift over time as we learn more about things farther down in the tree. If we use UCB1 for every node in the search tree, we get a modified algorithm called Upper Confidence Trees, or UCT.
The AlphaZero approach modifies UCT by including the prior probabilities P(s, a) computed using the neural net and is called Predictor - Upper Confidence Trees, or PUCT.
OK that was a lot of words, let’s look at what this means in math! Here is the formula for PUCT(s, a).
The first term Q(s, a) is the average game result across current simulations that took action a from state s. Note that not all paths that took action a from state s end in terminal states in the current search tree, so Q(s, a) depends heavily on the guesswork of the neural net as to the value of non-terminal leaf states.
To see this, note that for the paths in the search tree that don’t end in terminal states, the ‘game result’ from that path is the estimated value V produced by the neural net when that state was first added as a leaf state. When the search tree is first being built, most of the paths after taking action a from state s will end in non-terminal states, so the estimated value of taking action a is constructed mostly of the guesses the neural net is making about the value of those states.
Q(s, a) is a measure of how good taking action a has been so far. If we were to just have this term, and we took the action that maximized it, we would be using a strategy that only exploited and never explored. The second term U(s, a) is an exploration term. It is proportional to the ‘intuitive’ prior probabilities P(s, a) (what the neural network thinks is the right thing to do), a constant c (higher means we explore more, usually in the range ~1-6), and the rightmost term which is largest for actions that have been tried the least (as N(s, a) gets bigger, this term shrinks).
PUCT(s, a) is used in the Selection Algorithm by choosing the action a for which PUCT(s, a) is largest.
Objective: Given a current state s, select the action a* from the set of possible legal actions {a} that maximizes PUCT(s, a)
The Selection Algorithm
Compute PUCT(s, a) for the current state s using the formula
2. Select and return action a* where
Adding Dirichlet Noise to the Root State Prior Probabilities
If the current state s is a root state, the DeepMind folks add a specific type of noise to P(s, a). While for all non-root nodes P(s, a) is the output of the neural net acting on that state, for the root we modify this to be
where p(s, a) is the neural net output, and Dir(α) is a sample from the symmetric Dirichlet distribution with parameter α. The reasons for using this particular noise distribution and the setting of the parameter are a bit arcane, but the outcome is that adding the noise encourages exploration of actions that generally wouldn’t be explored based on just the p(s, a) values.
For AlphaGo Zero, ε=0.25, and for AlphaZero, α was scaled in inverse proportion to the approximate number of legal moves in a typical position, to a value of α={0.3,0.15,0.03} for chess, shogi and Go respectively. In the case of the RSG, we have at most 9 legal moves, and about half that from a typical position, so α=0.2 is probably a good choice. To do this in practice is easy, here’s how it works:
# Combine the original probabilities with Dirichlet noise p_noisy = (1 - epsilon) * p + epsilon * np.random.dirichlet([alpha] * len(p))
Dirichlet noise is added only during self-play to enhance exploration and improve training. It is omitted during runtime to allow the trained model to perform at its full strength.
Visualizing The Action-Taking Process
I thought it might be helpful to show how everything I just went over works by going step-by-step and describing what’s happening as it happens.
Here’s my attempt to do that — and also the inaugural episode of my shiny new Many Worlds podcast! Subscribe and smash that Like button! Or something!
If you have any questions about any of this, put them in the comments and I’ll try to answer them.