Deconstructing AlphaZero Part 4: How An Agent Learns

Jan 13

In the last post, I described how an AlphaZero agent decides to take an action. Being able to take an action from any state — even if the actions are random — is enough for the agent to play against itself, as a fully played game is just a series of actions.

Different action choices lead to different outcomes, and an agent is very much invested in certain outcomes over others! In the introductory post in this series, I wrote

Agents all have the same objective: to maximize their cumulative reward over time. This is their intrinsic goal, which is sometimes also called a final goal or an end goal. It is their overarching purpose. It is their desired outcome that does not depend on achieving anything else beyond itself.

Here I want to show you how an AlphaZero agent learns about its world. As it learns, its ability to intuit what to do grows, and it gets better at taking actions that help in achieving its intrinsic goal.

The AlphaZero Neural Net

A big part of the AlphaZero decision-making process comes from the agent’s neural net’s guess at a policy, and what the value of the current state is. I think of the neural net’s outputs as the agent’s intuition about what action to take, and how good the current state is. For AlphaZero agents, learning to take better actions (or said another way, becoming more intelligent) happens via improving this intuition.

At first, the neural net just outputs random opinions. It doesn’t know anything about its world and suggests stupid things. For it to be useful to the agent, it needs to learn from experience what actions to take or avoid from the state it is in.

Technically how this works is that the agent’s neural net’s weights are updated based on the results of the agent playing against itself. This is a supervised learning step, where the data we are training on has inputs of state s and outputs of policy π(s, a) and value V(s). These training data objects come from the previous self-play step and ultimately are grounded in the reality of the results of actual games.

Over time and many cycles of learning, something like an implicit model of the agent’s world gets slowly baked into the neural net. Eventually, when this world model gets good enough, the agent ‘just knows’ what the right move to make is from whatever state it’s in.

Visualizing Generation of Training Data During Self-Play, and Training the Agent’s Neural Net

In the video below I go over both how data is extracted from the self-play step, and how my neural net implementation works.

Let me know if any of this is unclear, or if I missed explaining something important!

Geordie Rose

Deconstructing AlphaZero Part 4: How An Agent Learns

The AlphaZero Neural Net

Visualizing Generation of Training Data During Self-Play, and Training the Agent’s Neural Net

Deconstructing AlphaZero Part 5: The Joys and Horrors of Parallelization

Deconstructing AlphaZero Part 3: How You Gonna Act Like That

© 2025 Snowdrop Quantum Applications Corporation