I initially got interested in taking AlphaZero apart because I wanted to see whether it could be used to create agents for playing the new kind of game I’m working on. These new kinds of games are just like regular games, but with a twist: figuring out who won (equivalently, computing the reward function) might be best done using a quantum computer.

This is still my primary interest. But AlphaZero turned out to be a lot cooler than I’d originally thought. As I started playing with its machinery, I started wondering about what situations it might be possible to apply it to outside of just playing games. So what I thought I’d do here is try to explain it in a general way, but use Tangled to make the ideas concrete.

Regardless of which situation you might want to apply it to, there are a few things you need to work through first. In this post, I’ll walk you through the steps you’ll have to take to apply it to whatever situation you are interested in. All I ask in return is that when you unleash your paperclip maximizer into the world you give me some warning. Fair?

There are three key things you need to define before you build any reinforcement learning agent. They are the things listed in the diagram in the previous post — state, actions, and reward. Let me reproduce it here so you can stare at its sparse majesty yet again!

Once you have these three things defined, mostly everything else is just machinery. But this first step of defining what these are, and in particular how you are going to represent state (this is the hard one), is not always straightforward. There’s no prescriptive way for me to describe how to do it in general.

Instead what I’m going to do is show you how to do it in the specific case of building agents for playing Tangled, but I’ll try to give you some insight into how I think about the more general case.

Step 0: What Is Your Agent’s Overarching Purpose?

In the previous post, I introduced the concept of an agent’s intrinsic goal, which you can think of as its overarching purpose. Before we start getting math-y, we have to spend some time thinking about what we want this to be. It’s critically important to be clear about this as everything follows from it.

Your agent will single-mindedly be trying its best to maximize its cumulative reward over time. This means you have to think about what you want the agent to get reward from. This not only requires you to think about the reward process itself but also what the agent needs to know about the world (state) and what it needs to be able to do (actions) to get that reward.

Part of the reason why agents have historically been built for abstract games is that it’s relatively straightforward to figure out what we want that kind of agent to be fixated on: winning the game!

So from now on, we are going to make a choice: we want our agent to excel at winning a specific abstract game (Tangled). But to make this more interactive and fun, I’d suggest as you follow along here, you think about other things you might want an agent to be obsessed with — winning a different game you like, or making money trading stocks, or driving a car, or defending a power plant from an attacking drone swarm, or winning a Nobel Prize, I’m sure you can think of a million other things, and think about how you’d approach each of these steps. If you get stuck and want some help, put your idea in the comments and I’ll see if I can help!

Step 1: Defining State For Your Agent

Now that we’ve decided what our agent is going to try to excel at, the next thing we are going to do is figure out how we are going to define state. Warning: defining state is nearly always the hardest part of the entire process.

Your job as a state designer is to figure out, of all the information you might provide to your agent about the world, what part is relevant to it achieving its intrinsic goal. For example, if we want to build a digital chess-playing agent, clearly we need to be able to provide enough information to represent all the legal chessboard positions. But we don’t need to include things like a list of all the elements of the periodic table or your mom’s maiden name. Those things do exist in the agent’s environment, but they are not relevant to being good at the only thing it cares about, which is winning digital chess games.

Our down-selecting state to the subset of information about the world that’s relevant to our agent’s pursuit of its intrinsic goal is an example of an attention mechanism. An attention mechanism is some prescription for ignoring information that isn’t useful to whatever we are doing at the present time. Selecting state representation is an ‘intelligent design’ attention mechanism where we decide what’s important and what isn’t for our agent.

(You might be wondering whether it might be possible to build agents that learn attention mechanisms and therefore could consume much richer information about the world and have the agent itself learn what is important to achieving its goals. I think yes, that is a very good idea (and one of the core ideas that made transformers and large language models work so well), and would allow the development of interesting and maybe surprising instrumental goals. But at least for what I will show you, that extra step of learning what is relevant in a richer state representation is a story for another day.)

For abstract board games, like chess, Go, hex, checkers, tic-tac-toe, and Tangled, the actual game states themselves can be used as the state representation. This may seem obvious — if I want to define the states of a game, all I need is … the states of the game? But recognize that the fact that this works for abstract board games is pretty much unique to them. There are very few things in the world where you can get away with ignoring everything except some tiny little abstract math pocket universe and get away with it!

I’m going to show you how this works for Tangled. Unfortunately, unless the thing you want your agent to do is play a game well, this method won’t directly apply to other situations (although if your interest is actually in another game, just do something like what I’m doing!). For whatever you are interested in, use your creativity to figure out something that works in your case.

Defining State For Tangled Agents

For building AlphaZero agents, I want to modify the standard Tangled state representation in the following way. For the vertex states, if player 1 (red) owns a vertex, the state value of that vertex will be 1, and if player 2 (blue) owns a vertex, the state value will be -1 (in my previous state representation I called this 2). Unowned vertices are labeled with 0 as before.

We will represent the state of each edge with three bits. If an edge is as yet uncolored, its state is 000. If the edge is colored gray (zero coupling), its state is 100. If the edge is colored green (ferromagnetic coupling), its state is 010. Lastly, if the edge is colored purple (anti-ferromagnetic), its state is 001. This is a ‘one-hot’ encoding of the three possible edge states. This is different than my original state representation, where edges were labeled with integers 0, 1, 2, or 3.

For general graphs with V vertices and E edges, this encoding generates state representations of length V+3*E, which are formed of only +1, -1, and 0 values.

The reason why I modified the state representation in this way is that this state is going to be the input to a neural net in AlphaZero, and I figured it would work better if the inputs were all -1, 0, or +1, instead of what they were before (0, 1, 2, and 3).

A Specific Example: Tangled On The Ridiculously Small Graph

Tangled is pretty cool for a bunch of reasons (unfortunately being fun to play is maybe not one of them). One of the great things about it is that you can play Tangled on any graph you want. The graph you choose becomes the game board. Being able to choose any game board we want allows us to try things on very small game boards to test ideas out, and once we have some confidence things work like we want, we can scale up to arbitrarily large game boards.

In the spirit of trying things out with very small game boards, please meet the RSG (Ridiculously Small Graph). (It’s not really called that, this is just what I call it.) It has only three vertices and two edges. It is the smallest connected graph where Tangled games don’t always end in draws.

Here it is! I’ve labeled the vertices 0, 1, and 2 from left to right. The leftmost edge connecting vertices 0 and 1 is (0,1), and the rightmost edge connecting vertices 1 and 2 is (1,2).

For the RSG, state is a list of V+3*E = 3 + 3*2 = 9 numbers, each of which can only be +1, -1, or 0, where the first (leftmost) three refer to the vertices, and the last six refer to the edges in dictionary order (three bits for [0,1] first, then three more bits for [1,2] second).

This is an example of how I am writing down state for Tangled, specifically for the RSG (all other graphs are similar, just with more numbers). Just take the current game board, which has numbered vertices. For each vertex, write a +1 into state if it’s owned by red, a -1 if it’s owned by blue, and a 0 if it’s unclaimed. Then follow this by looking at each edge in the graph in lexical order and writing down three bits per edge, where the three bits are 0,0,0 if the edge is uncolored, 0,0,1 if the edge is colored gray, 0,1,0 if the edge is colored green, and 0,0,1 if the edge is colored purple. In the above example on the RSG, we start with edge (0,1) and since it’s unselected (that’s what the dashed line is supposed to mean) we write three zeros into state, and then the last edge (1,2) is purple so we write 0,0,1 into state. For the RSG, since there are always three vertices and two edges, state can always be written down as nine numbers.

Second, Define Your Reward Function

Once you have your state representation figured out, the next step is to define your reward function. This requires thinking carefully about how you want to implement the overarching purpose in Step 0.

For games, a natural way to do this is to choose a reward function that gives +1 reward to states corresponding to wins for the agent, -1 reward to states corresponding to losses for the agent, 0 reward for all other non-end-game states, and a small positive reward (say +0.01) for end-game states that end in draws. This is what we will do for Tangled.

Note that knowing what the reward is in this case requires determining who won the game, which can be computationally difficult (for Tangled, it is a problem where quantum supremacy has been claimed, so if that claim holds up, a quantum computer should be much faster at computing reward, which as we will see is very important).

Third, Define Your Action Set

Once you have your state representation and reward function figured out, the next step is to provide your agent with a set of atomic actions it can perform. How specifically you choose these depends on your situation. For abstract board games, and games in general, actions are just the things a player can do when it’s their move, and it’s usually pretty straightforward to write them down.

For Tangled played on a general graph there are V+3*E atomic actions, corresponding to either coloring a vertex your color (red or blue depending on which player you are) or coloring an edge one of three colors.

For Tangled on the RSG, this gives nine total atomic actions: actions 0, 1, and 2 correspond to selecting vertices 0, 1, and 2 respectively; actions 3, 4, and 5 correspond to coloring the edge (0,1) gray, green, and purple respectively; and actions 6, 7, and 8 correspond to coloring the edge (1,2) gray, green, and purple respectively.

The way to understand this is that for example ‘taking action 7’ means ‘coloring edge (1,2) green’.

Fourth, Optional: Create or Imagine Your Game Tree

AlphaZero assumes an adversarial competition between two agents, where each is trying to beat the other, and by so doing they learn how to get better at whatever they are competing at. While the most obvious way to think about this is two agents playing against each other in a zero-sum game, like chess or Go, I think this framing applies to anything an agent can do. The reason for this is that the quality of an agent can always be measured as a numerical ‘fitness score’ (how much reward it’s been able to accumulate) and therefore two agents doing their thing can be thought of as a two-player game where the scores of the players are just the amount of reward each has been able to accumulate.

Note that I’m not entirely sure about this, because there are aspects of two-player zero-sum games that are important here that may not be present in other situations. One of them is the concept of taking turns; another is the zero-sum nature means that the value of a state for one agent is the negative of the value for the other; another is the finite nature of games (they end).

If you have been following along so far with an example you think would be cool to build, I’d like you to change your perspective and think about the following. Imagine there were two people (say the Winklevoss twins because why not) and you wanted them to get good at whatever you care about, and the way they were going to do it was to compete against each other at something. Whatever they were competing at would require them to take turns doing something. As they do this, each of them gets better by learning about how to beat the other one. The score each gets in one game could be the amount of reward they accumulated over one episode, where an episode is some length of time appropriate for your case. How would you set up the competition between the Winklevii?

A Game Tree is a structure that contains all of the possible actions taken from all the possible states from the beginning to the end of an episode. I’m calling it a game tree because in my specific case it is literally a game tree, but like I said even if your case isn’t a game you might still be able to think of it like one! If your case is an actual game, the game tree is usually pretty conceptually simple — players take turns taking actions, and the end of an episode is naturally the end of the game.

For interesting situations, game trees are giga-enormous and far too big to explicitly write down. But just thinking about it in the abstract may help you find issues with your definitions of state, reward, and action set, and I would advise spending at least some time working through what your game tree might look like, if only for one specific episode. (As an aside, some reinforcement learning algorithms, such as tabular Q-learning, do contemplate entire game trees — the complete Q(s,a) table encodes the entire game tree as a table containing all the states (s) and all the actions (a).)

Tangled on the RSG generates a very small game tree — we can write down all possible states and all possible actions from those states in a single diagram. This is super useful for what’s to come when we dive into the specifics of how AlphaZero agents work.

OK let’s do it — let’s write down the entire game tree for Tangled on the RSG!

All Tangled games start from the game state where all the values of the state are zero — the empty board.

All games start with an empty board; the initial state is always [0,0,0, 0,0,0, 0,0,0] — three vertices unselected, two edges unselected.

Player 1 (red) goes first. For the first move, all nine possible actions are legal for red, leading to nine possible states after the first move is made. Of these nine states, five are unique (the RSG has a mirror symmetry around the middle vertex). Shown in the following figure are the nine actions and the states they lead to, visualized both as graphs and in their numerical state format, and also (bottom) the unique states and the actions that lead to them in lexical / dictionary order.

The game proceeds with player 2 (blue) taking the next move, red moving third, and finally blue making the fourth and final move (a two-player Tangled game always has exactly 2 + E moves, where E is the number of edges in the game board graph). For the game tree, we need to list all of the possibilities for states at each turn.

To complete the game tree, we adjudicate all possible terminal states — there are 27 unique ones giving 7 red wins, 7 blue wins, and 13 draws as shown below. Any possible game is a path starting from the initial state, selecting one of the legal states at each level, and eventually ending in one terminal state.

Deconstructing AlphaZero Part 2: State, Actions, and Reward, and Dreaming of Game Trees