Deconstructing AlphaZero Part 5: The Joys and Horrors of Parallelization

Feb 6

I Wrote This Originally on January 13, 2025

I thought it might be fun to give you some insight into how I decide what to work on. When building things, you always face choices about what to do in what order. Here’s an example.

I have a working AlphaZero implementation. It does everything it’s supposed to do and works well. Unfortunately, it does not use my computer’s resources efficiently. As the algorithm grinds on endlessly, most of my system's processing power just sits there, not being used. Not great!

I see a way to change how the code works that could dramatically speed everything up. This improvement requires running both self-play and competitive phases in parallel instead of in series. Currently during both of these, the system runs one process on the CPU, and one on the GPU (handling inference on leaf states), and it alternates using these two until one game is done, and then it moves on to the next game, and so on, until all the games are played. But this doesn’t make sense. Instead what we should do is run both self-play and competitive phase games in parallel. Why? Because they are all independent of each other. No game depends on any of the others, except through the use of a common neural net for inference. This is a situation that is called being embarrassingly parallel.

But there is another reason to do this. The majority of the time during serial game play is spent querying the neural net. If you think about what is happening here, each time the net is queried, a single state is sent to it and a single policy and value is returned. Instead what we should be doing is sending a lot of states in at the same time and getting back a lot of answers at the same time.

If we were to run multiple self-play games in parallel, every time any of them gets to a leaf state, we could write that state to a queue, have the neural net wait until the queue is a certain length, and then process all of these at the same time. It would then output all the results back to a separate queue that the self-play games would consume to get the result they need to continue self-play.

This whole thing would be trivial to implement except for one complicating factor: all the self-play games share a single neural net process running on one GPU. So we need to figure out how to do that efficiently.

There is also a separate question of whether to further complicate matters by, in addition to setting up multiple processes, using multiple threads. The difference between the two is that processes are self-contained computing environments that generally work best by consuming the compute of a single core on a processor (so if you have 20 cores, you’d want 20 processes, one per core). But threads run within processes, and all of them share everything inside the process. This makes them subject to certain unfortunate things related to Python’s Global Interpreter Lock (GIL), but if you can code to avoid that sort of thing, this can dramatically increase the number of things you can do in parallel.

I think in this case, it’s prudent to not use threads, and instead only use processes. This does cap the parallelism upside somewhat, but 20x faster for 20 cores is pretty damned good. And if the thing really does work, we could invest in a system upgrade to get a processor with a lot more cores (like the high core count AMD Threadrippers).

Threadripper Pro 7995WX, 96 cores, 5.3GHz clock, 384MB of L3 cache, and 128 PCIe Gen 5 lanes. Want.

The Typical Science / Technology Quandary

So now the question I’m facing is whether to spend the time to implement the above changes. While it does offer the possibility of a dramatic speed-up, it will require considerable time and effort to design, test, and debug the whole thing — maybe ~80 hours of work. And there is the real possibility that I could run up against a fundamental issue that prevents the thing from working at all and therefore all the time spent would be wasted.

Another issue is that if you want to do AlphaZero ‘right’, you want it to be able to scale to large cloud computing environments. Doing the sort of local optimization above might help in understanding how to do the distributed computing required, but if you then move to the cloud and start doing it for real, all of the work to get it working locally probably requires a full from-scratch re-engineering of everything. Wouldn’t just doing that first be a better use of time?

Another consideration. In a previous life we built and used a BOINC project at D-Wave. Could an AlphaZero agent be built that runs on BOINC? If so, you’d want to do the local optimization I’m thinking about here. But do I want to have to try to re-learn how to use BOINC? That’s another question to ponder.

So the question here — do I try to implement self-play and competitive phase parallelism or not. I think I’m going to, because the potential speed-ups are worth the risk, and I’ll learn a lot about making this sort of parallel system work!

Update, Three Weeks Later… February 3 2025

Well friends, I started putting in the work on the parallelization problem, and unfortunately I have so far been unable to make it work the way I think it should. Not for lack of trying though. Along the way I discovered some things and did some side quests that I’ll briefly talk about here.

The first thing I tried was to use Python’s multiprocessing with queues, where the CPU processes generated leaf states, put them on queues, and the GPU processes consumed those states and output the answers to another queue, which was consumed by the CPU processes. I got this to work, but it did not speed the entire thing up. Some of the key things I changed were to make sure any objects connected by processes did not have large data objects they needed to share (for example all the train_examples_history doesn’t need to be a member variable of Coach) and to remove the neural net from the AZMCTS class altogether (it doesn’t need it, its function is replaced by putting and getting from a queue). When I profiled it, by far the most time was spent in opaque Windows things related to processes waiting for other processes (_winapi.WaitForSingleObject ). Apparently, at least according to Deepseek, “The _winapi.WaitForSingleObject dominating the runtime is a common issue when using Python's multiprocessing on Windows. This is because Windows uses a different process spawning mechanism compared to Unix-based systems (like Linux or macOS), which introduces significant overhead when creating and managing processes.” OK then.
I started interrogating Deepseek about alternative methods, and one of these that swapped multiprocessing for concurrent.futures.ProcessPoolExecutor looked promising (I did some preliminary tests on a simple neural net and it seemed to remove all the Windows-y cruft). However, the example used some features of pure Pytorch neural nets for sharing weights across processes, whereas I had been using Keras with Pytorch as a backend. I had not used pure Pytorch before, but I decided to remove Keras and learn how to use it. This took some fiddling, but eventually I had a pure Pytorch implementation, and interestingly it was much faster than the Keras version — so much so that the calls to the neural net for inference stopped being the bottleneck. I ran a test with 50 MCTS simulations and found a 4.1x speedup in self-play, 4.4x speedup in training, and 3.9x speedup in competitive play just by replacing Keras with pure Pytorch. I am not sure why. I suspect it has something to do with how the models and weights are loaded into the GPU. Pleasant unexpected bonus — I’m going to keep the pure Pytorch approach going forward.
The first step to replacing Keras was on the original serial code, and since I had to learn how to implement neural nets using Pytorch, I figured I’d take a detour to look at implementing graph neural nets for this case. Graph neural nets are designed to take graphs as input, so that seemed like a natural thing to do for Tangled. I implemented one, but I found that its performance was significantly worse than the basic convnet style thing I had working where the graph inputs were just flattened into a one-dimensional input. It could just be that my implementation was stupid and/or had bugs, but since there didn’t seem to be any obvious wins and it complicated matters quite a bit, I dropped the graph neural net stuff for the time being. Maybe something to revisit later.
I then started trying to use concurrent.futures.ProcessPoolExecutor and model.share_weights() approach. It worked, but again it did not speed things up, and in fact it actually slowed down the entire computation by quite a bit. Arg! Why???
I’m going to keep working on this for one more week and if I can’t solve it I’ll give up.

Update, February 6 2025: IT’S ALIVE!!!

OK friends, I think I’ve got the whole thing working now!

The main issues with parallel code tend to be architectural or structural. You need to make sure that the way you parallelize something, in terms of what data structures are instantiated where, follows a set of strict patterns. So what I did is essentially rewrite the entire code base from scratch in ‘thinking parallel’ mode. One of the mistakes I made here is that I was trying to maintain the architecture of the serial code and ‘just parallelize it’. Advice: don’t do that.

Instead I used a combination of Claude, GPT, and DeepSeek to suggest and explain good patterns for how to set up the architecture of the thing, and rewrote my code following the suggested patterns. And it worked!

Here are some of the changes I made, with some of the reasons for them.

A Good Pattern For Executing Embarrassingly Parallel Code

Windows, and even the IDE I use (pycharm), have some peculiarities when you are running multiprocessing. If you also want to use cProfile (I generally do), issues compound. A pattern I highly suggest you use is, in the script that’s running your code:

import cProfile
import pstats
import multiprocessing as mp

if __name__ == "__main__":
    mp.set_start_method("spawn", force=True)  # Ensures correct behavior in PyCharm

    with cProfile.Profile() as pr:
        main()

    stats = pstats.Stats(pr)
    stats.strip_dirs().sort_stats('tottime').print_stats(20)   # show top 20 results

Encapsulating under if __name__ == "__main__": ensures that you don’t accidentally recursively spawn processes. The mp.set_start_method("spawn", force=True) refers to the way processes are created (different OS (and even pycharm itself) don’t always have the same method, so you should explicitly state it). You should make sure that the main() process runs inside a specific cProfile context. Note that all of these things might be obvious but if you do these things in different orders or omit one of these it can get very difficult to identify issues.

Pay Close Attention To Variable Scope and Don’t Use Global Variables

In my serial code, I declared some global variables above main(). This worked fine in that case, but it is not good practice for writing parallel code. Move all your variables into encapsulated scopes. You want to minimize passing large or complex objects between different scopes.

A Good Pattern For Parallel Code

Inside main(), enter into your parallelization using a one-line function call. Don’t be fancy inside main(). For example, do this:

def main():
    args = {}
    self_play_data = parallel_self_play(args)

Next, write that function. Inside it you will spawn a set of new processes. In my case it looks like this:

from concurrent.futures import ProcessPoolExecutor, as_completed

def parallel_self_play(args):

    futures = []

    with ProcessPoolExecutor(max_workers=args.num_workers) as executor:
        for _ in range(args.num_workers):  # eg 4
            future = executor.submit(self_play_worker, args)
            futures.append(future)

        self_play_data = []
        for future in as_completed(futures):
            self_play_data.extend(future.result())

    return self_play_data

Now write the self_play_worker function. In my case, I need the AlphaZero neural net for the processes I’m going to parallelize. I get it by loading it in from disk inside this function. Note that I am not passing it in as an argument here. Each of the self_play_worker functions is a separate process and so when you load the neural net inside it, each of these processes are all loading a presumably identical copy of the neural net. This is an example of how thinking about encapsulation is important for parallel code. Also, I would advise deleting the model and emptying the cuda cache after you are done with this to explicitly clear memory.

def self_play_worker(args):

    model = load_model_and_optimizer()   # load pytorch model and optimizer

    model.eval()  # Set the model to evaluation mode

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    worker_data = [execute_episode(model, args) for _ in range(args.games_per_worker)]

    del model
    torch.cuda.empty_cache()

    return worker_data

And finally the payload function that does the work inside each process running self_play_worker. I’m assuming here that there are many more jobs to run than processes, so each process has to run args.games_per_worker jobs.

def execute_episode(model, args):
    # create a new mcts for each execute_episode
    mcts = MCTS(model, args)  # reset search tree

    # play a game, calling model for inference on MCTS leaf states as usual

    return [(state, policy, value) for _ in train_examples]

This pattern (main ==> parallel (spawns processes) == > worker (one worker per process) ==> payload (delivers payload back to worker) is what I’d recommend you follow. When using neural nets that run on GPUs, load them from disk within the worker after that process has been spawned and make sure you clear them from memory before exiting. Make all of these encapsulated functions that don’t share large or complex data structures.

Speedup From Parallelization

I ran a test with serial pure Pytorch vs my new parallel implementation. I tried 2, 4, 8, and 16 workers. I found of these 8 was the fastest and my system (3090 GPU + processor with 10 cores) so am using that number.

I found that for a test with 50 MCTS simulations, the 8-worker parallel version was 4.2x faster during self-play and 5.8x faster during competitive play — so not linear speedup (GPU calls likely slow things down). The training step is identical in both so no speedup there. But a very significant win — together with the switch to Pytorch, the total speedup I was able to get over the past month was around 15-20x in total wallclock time, which is massive.

Using The GPU Effectively

In this strategy, I’m just making a bunch of copies of the same neural net in each process and loading them onto the GPU. This works but I still think it’s kind of stupid. It would be great if all the processes could share the same neural net so it only had to be loaded once. But on the other hand this could just move the bottleneck back to the GPU. I think as long as the neural net is small enough to load multiple copies, doing it this way is fine. For some of my recent experiments, I was using an ~8 million parameter neural net, and I found that the optimal number of processes was around 8. If I estimate the memory consumption of 8M parameters including overhead to be roughly 500MB, 8 is well within the 24GB on my 3090 card.

Summary

Well it took me about a month of banging away at this problem and trying a bunch of things that seemed to make sense before I finally found a way to do it. The investment was worth it, not only because I now have a much faster and cleaner AlphaZero implementation, but also I learned a lot about better ways to architect parallel code.

Geordie Rose