Deconstructing AlphaZero Part 1: Introduction

AlphaZero is a beautiful algorithm developed by the folks at DeepMind around 2016-2017. Its output is a software agent — an AI system that has agency (it makes decisions and takes actions by itself with no outside intervention).

While originally designed to create game-playing agents, it can potentially be applied in many situations.

Its core idea is to provide an agent with two different decision-making processes. The first is slow, deliberative, logical, and grounded in real experience, and the second is lightning-fast, intuitive, and decisive. With some poetic license, these are ‘thinking slow’ and ‘thinking fast’ processes.

An AlphaZero agent self-improves by competing against itself, using the results of these competitions to improve its intuition about what to do in different situations. As its intuition grows, the agent can rely on it to make better, faster, choices.

In the next few posts, I’ll explain how AlphaZero works, and show how to use it to build superhuman agents for playing games.

I’ll use Tangled in my examples, because of my interest in games where determining who won is a domain where quantum supremacy has been claimed, but these ideas apply to many other situations as well.

Let’s jump in!

Agents: A Primer

An agent is a system that makes decisions and takes actions. The term agent is starting to be used more and more in everyday discussions of AI, and there is a lot of excitement about them. But they are not new. Agency is a foundational idea in reinforcement learning, which studies how systems learn from trying new things and exploring the world they inhabit.

Here I’m going to introduce the basic concepts behind understanding what agents are and how they work. AlphaZero is a process for building agents, so I figured it would be useful to describe what an agent actually is!

I’ll start by providing a very high-level overview, leaving details for subsequent posts. I’m going to try to make this accessible and not require any specialized knowledge.

Later on when we get into the details it might get a bit harder to follow. But the basic ideas here are understandable if you meditate on them a bit. It’s worth doing this as these types of agents may become ubiquitous over the next few years. You should understand how they work!

Very Important Concepts

First I’m going to show you an image that provides the framing for how to think about what an agent is and does.

This is Figure 3.1 of Rich Sutton and Andrew Barto’s classic Reinforcement Learning book, which is highly recommended if you are interested in a deeper dive into all of this.

This deceptively simple picture, where an agent takes an action that changes its environment, thereby altering the environment’s state and generating reward, is a masterpiece of compression. It reminds me of the foundational equations of physics, like the Schrodinger equation or the Einstein field equations — a simple thing you can just write down that looks unassuming enough, but when unfolded becomes a majestic thing that describes so much of our world.

For our present purposes, we don’t need to get too metaphysical…. even though I sometimes can’t help myself. I would though like to try to give you a deeper understanding of what all the words and terms in this diagram mean, so the detailed stuff later on has some context and will be easier to follow.

The Environment

The environment contains everything that is not the agent. One of the subtle things about this picture is the axiomatic separation of the entirety of everything into two disjoint parts — ‘agent’ and ‘not agent’. This categorization is the most fundamental aspect of the theory. It can be a very weird experience to try to draw this boundary (I thought a lot about this at Kindred and Sanctuary, where we were building agents for robots).

For example, where is the boundary that separates the agent that takes your own actions and its environment? Is your physical brain part of the environment, or is it part of the agent? In a robot, it seems clear that the agent somehow lives in software. Is the entirety of the physical robot part of the environment, with the software some kind of ethereal ‘other’ defined by patterns of real things (like electrons flowing through wires) but not exactly a real thing itself? Is the mind-body problem exactly analogous to the software-hardware relationship in computers? If you observe an analog physical system, such as a tree blowing in the wind, and you ascribe change (e.g. branches moving about) to some kind of agency, are the abstract laws of physics themselves the agents ‘deciding’ how to change the environment? If so, where do these laws ‘live’? Are they themselves like software, i.e. not actually residing in the physical world, but some kind of abstract animating spirit?

Also … here’s a mind-bender. In the image above, if you swap the labels on the agent and environment, and re-interpret actions as state and reward and state and reward as actions, you see the picture is identical. If you think of yourself as an agent, this interpretation means that you could identically view your agency as the environment which the rest of the universe is acting on and changing. In this inverted picture, it’s the rest of the universe that has agency, and what you think of as your own agency is an environment responding to the actions of the universe, which the universe observes by looking at what actions you take; and in those actions, the universe gets some kind of reward if your actions are just so.

Wild! … Anyway.

State

State at time t, denoted Sₜ, is a representation of the environment at time t, from the perspective of the agent.

In this picture, time is discrete and comes in ‘ticks’. The limit of the difference between two adjacent ticks t and t+1 can be shrunk to zero, making the theory essentially continuous in time, but that is a limiting case that in practice is rarely used.

State is a representation of the environment. It is not the environment. The environment is usually unknowable, except as a categorical thing (everything that isn’t the agent). State on the other hand is some compressed, focused, specific, and concrete representation of the environment that the agent has access to. This is true for all agents, including yourself. Your beliefs about the world and how they are focused and represented (state) is not what is actually true about the world (environment)… but they are related somehow (we still don’t know how, but in some sense this is the mission of theoretical physics).

While state Sₜ is labeled by a time t, it can and usually does contain information about the past. In games like Chess and Go, where the rules require knowledge of an entire game sequence, the full sequence of game boards from the beginning of the game to the current board is an example of a possible representation of state.

For the agent that lives inside you, things that happened in the past are very important for deciding what actions you should take. Your memories, even though they happened in the past, are in your current state.

Sometimes, things that happened in the past are not important for an agent, and state can be chosen to just be the current situation. Games like Connect 4 and Tangled have this property — the current game board has all the information the agent needs.

The number of possible states is usually astronomically gigantic for interesting situations. Even for a tiny part of the universe — say a 19x19 Go board — there are something like 2.1*10¹⁷⁰ legal game boards, and the number of possible games is unfathomably large.

This is an example of a state in the game of 7x6 Connect 4. If a ‘time tick’ is a move, and the empty board is S₀, this is a specific legal arrangement of the pieces on the game board after 16 ticks (so a possible example of S₁₆). There are 4,531,985,219,092 legal 7x6 Connect 4 states, but each specific one is simple and easy to write down. If our agent is asked to make a move, just knowing the current state is enough to know everything it needs to know!

Actions

An action at time t, denoted Aₜ, is the action the agent decides to take at time step t.

Aₜ is selected from a finite set of possible ‘atomic actions’ an agent is able to take. This action set is an exhaustive list of all the things the agent can do at any particular time tick.

In chess, there are 218 atomic actions, including things like ‘move a pawn forward one square’, ‘move a pawn forward two squares’, or ‘move a bishop three squares diagonally forward and to the left’.

In a computer processor, the action set is called the Instruction Set. The Intel 4004 microprocessor had 46 atomic actions in its instruction set.

In writing English natural language text (or programming code!), the action set is roughly 100 English letters (upper and lower case), numbers, and punctuation.

The number of possible atomic actions an agent can take is usually quite small — at most a few hundred, and often much less than this. Even so, an agent can radically change its environment by combining long chains of atomic actions applied in sequence (for example in a processor, a program you’d write in a high-level language like Python is typically compiled to many billions of sequential processor actions, and those billions of very simple atomic actions — like changing a memory register from 0 to 1 — can end up having drastic effects on the real world, like launching a nuclear weapon. And a specific sequence of written characters can change the course of history).

Policy

For any state Sₜ, an agent’s policy, usually denoted π(Sₜ) (π is pronounced pi which starts with the letter p… like policy… ), is a list of probabilities of taking each of its possible atomic actions from state Sₜ.

As an example, an agent could have a policy of randomly selecting an action from its action set for any state. If there were 5 actions it could take, the random policy would be π = [0.2, 0.2, 0.2, 0.2, 0.2] (20% chance of choosing any of the five actions, regardless of which state it is in). This random policy would probably not lead to great outcomes, but it’s an easy thing to implement and surprisingly useful for setting baseline performance you can compare against different policies.

Most policies depend on the details of the state the agent is in. For example if state had an entry for ‘red light’ or ‘green light’ for an autonomous driving agent, the agent’s policy for whether to take the ‘hit the gas’ or ‘hit the brakes’ actions should probably depend on that entry.

Reward

At each time tick, the agent receives a special signal called reward from the environment. Reward Rₜ is the ‘instantaneous feeling’ the agent gets in Sₜ after taking action Aₜ₋₁ from state Sₜ₋₁. Reward is usually a number between +1 and -1. We can define a reward function R(Sₜ) = Rₜ, which takes in any state and outputs its reward.

Note that even though it looks from the reward function definition that reward comes from state, it doesn’t — reward comes from the environment. It’s just correlated with state. This is important because an agent cannot directly control the environment, and therefore can’t change its reward function. The reward function itself is outside of the agent’s control — the only thing the agent can try to do is find states that have high reward, by acting on the environment and trying to get those high-reward states to show up.

Value

An agent may need many, many time ticks to get any feedback from the environment on the effects of its actions. For example in Chess, a typical game lasts about 40 moves, and only near the end does it become clear who is going to win, and therefore whether an agent will get the reward from the action leading to a checkmate.

Because of this, it’s usually more useful to think about a different but related attribute of a state, or state-action pair, which is called its value.

The value of a state, denoted Vπ(s), quantifies how "good" it is for an agent to be in that state, considering the expected rewards the agent can accumulate starting from that state and following some policy π.

A related thing, the value of an state-action pair, denoted Qπ(s, a), stores how “good” it is to take action a from state s following policy π. This is the Q in Q-learning, which is an important category of reinforcement learning algorithms (and in some sense the thing that started the modern AI revolution!).

To see why value could be a more useful way of thinking about things than reward, imagine you are playing Chess following some specific policy. We write down the reward you get after each of your moves. For all of your moves except the last one, you get zero reward. For the last move, you get either +1 (a win), -1 (a loss), or 0 (a draw).

If you were to play several games from a particular state and you end up winning them all, the total accumulated reward for each game would be +1 (adding up a bunch of zeros and eventually a +1 for the checkmate), and you are now pretty sure that if you can get to that state and follow your policy, you will win, so that state has high value, even though the reward you get from taking an action in it is always zero. So value contains information about how the current state, and the actions you can take from it, are likely to help or hurt you in the future, even if they don’t have any immediate effect.

We can define the value of any state or state-action pair to be a number between -1 and +1. A state or state-action pair has high value if it is likely to lead to high rewards in the future, and low value if it leads to few or no rewards in the future.

For games, the highest value states are terminal states where the agent won, and we assign a value of +1 to those states. The worst states are terminal states where the agent lost — those have value of -1.

If a game is two-player zero-sum, the value of a state for one agent is just the negative of the value of the state for the other agent (what’s good for one player is bad for the other in equal measure).

How does it feel…

To be in this state

What An Agent Is Trying To Do

Agents all have the same objective: to maximize their cumulative reward over time. This is their intrinsic goal, which is sometimes also called a final goal or an end goal. It is their overarching purpose. It is their desired outcome that does not depend on achieving anything else beyond itself.

(For biological organisms, the intrinsic goal seems to be to survive long enough to have children who themselves survive long enough to have children.)

Agents try to achieve their intrinsic goal by altering their policy over time and seeing what happens when different decisions are made in different situations, keeping things that seem to work.

All of the behaviors of an agent that appear while they do this exploration can be considered as seeking instrumental goals — subordinate or intermediate objectives pursued as a means to achieve the intrinsic goal. Instrumental goals are not valuable in themselves but they facilitate reaching the intrinsic goal. An example of an instrumental goal for people is saving money.

An agent’s intrinsic goal is defined uniquely by the reward function. Because of this, it is critical when creating an agent — especially one that is powerful enough to affect the real world with its actions — to pay close attention to this step.

Usually, nearly all attention is paid to the question of what reaching the intrinsic goal would mean. For example, if we set up a reward function that provides rewards for killing people, and the system is an embodied robot with access to weapons, we would be purposefully choosing to create an agent whose intrinsic goal is to kill people. We kind of understand this and can choose to do it or not.

But the harder thing to think through is much more subtle, which is what instrumental goals an agent will learn to seek, even in the context of a seemingly harmless intrinsic goal. Even something as innocuous as building a chess-playing agent has potential unforeseeable side effects. If the agent’s action set only contains abstract digital chess moves, there’s probably no issue, as the domain of effect the agent has is constrained to playing chess games in computers, and winning chess games is all it cares about. However, if you add things to its possibilities for actions that affect the real world, you create possibilities for it to find strategies for winning at chess that may be dangerous.

The AlphaZero Algorithm

Now that we know a little bit about what agents are and do, we are almost ready to go over a procedure for building a very interesting and powerful type of agent. Before I show you how to do this, I need to introduce two different kinds of algorithmic (designed to be run on computers) decision-making processes.

Monte Carlo Tree Search, Also Known As MCTS

Monte Carlo Tree Search is a slow, deliberative, and logical way to make decisions about what actions to take for a software agent. You can loosely speaking think about MCTS as a ‘thinking slow’ way for a software agent to decide what action to take.

MCTS is a procedure for systematically exploring a tree of exponentially exploding possibilities for what could happen in the future if an agent takes each of its possible atomic actions. While the algorithm builds on ideas going back to World War II era physics, its modern incarnation is quite recent, with the core ideas introduced in 2006 (check out Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search and Bandit based Monte-Carlo Planning for some early outlines).

MCTS takes as input the current state Sₜ and explores many possible ‘what-if’ scenarios, potentially far into the future. Eventually, after much deliberation, it outputs a policy π(Sₜ) — an educated guess at the probabilities of winning (or more generally, accruing maximal reward) by taking each of its possible atomic actions from the current state.

A Neural Network For Estimating Optimal Policy and Value

You have probably heard a lot about neural nets — maybe you even study them yourself! So I won’t talk too much about what they are — there are lots of great resources for learning about their history and how they work.

AlphaZero agents contain a single neural net, which inputs the current game state Sₜ and outputs two things: (1) a lightning-fast ‘intuitive’ guess at an optimal policy π(Sₜ); and (2) a prediction of the value of that state under that policy Vπ(Sₜ).

For an AlphaZero agent, this neural net is going to be a ‘thinking fast’ decision-maker.

The AlphaZero Algorithm

With these concepts defined, we are ready to write down a high-level overview of how AlphaZero works.

Here goes!


To build an AlphaZero agent, iterate these three steps until you run out of money or patience:

  1. [the current best agent competes against itself for a while]: Generate a set of completed games, where the best agent so far competes against itself, choosing moves using MCTS augmented with the best current neural net (how this works in detail we’ll go into in the next blog post!)

  2. [use the results of these games to try to build a more effective neural net]: Train a new neural net using a training set augmented with all [state, policy, game result] triples for all states explored in all self-play games in Step 1

  3. [if the new neural net is better, use it]: Pit the current agent using the current neural net against a new agent using the new neural net trained in Step 2 in a set of 1v1 games; if the new agent wins more than some percentage of those games, the new neural net becomes the current best neural net; otherwise keep the current neural net


That’s it! At least at this very high level, it’s surprisingly simple. Doing this works well for building superhuman game-playing agents.

Figure 1 from Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. The 700,000 steps represent 44, 24, and 21 million self-played games for Chess, Shogi, and Go respectively, and amounts to on the order of a few days of training for each game to go from scratch to superhuman mastery of each game.

Previous
Previous

Deconstructing AlphaZero Part 2: State, Actions, and Reward, and Dreaming of Game Trees

Next
Next

A Fun Math Problem