Setting up Elo Scores For a New Game
The Elo rating system was developed by Arpad Elo, a Hungarian-American physicist, in the 1960s to improve the existing chess rating system used by the United States Chess Federation (USCF).
This is Arpad Elo. Apparently his own Elo rating peaked around 2100-2200, making him very good but not grandmaster (2500+) level.
It was later adopted by FIDE (the International Chess Federation) in 1970, becoming the global standard for chess rankings.
Elo has since been adapted for various competitive domains, including other board games (Go, backgammon), e-sports (matchmaking systems in games like League of Legends), sports analytics (applied in football (NFL), basketball (NBA), and tennis for predictive modeling), and Kaggle competitions.
Here is a quote from Elo himself about his creation:
Fortunately, implementing it does not require any agitated water, corks, yard sticks, and/or ropes, although if you’d like to do so please send pics!
Here’s a rough way to see Elo tournament ratings for Chess players to see what Elo ratings correspond to play strength. I got it from this video which is worth watching if you’re interested in more information about Elo ratings in chess.
How to Set up an Elo Scoring System for a New Game
Elo is a relative scoring system (players are ranked relative to how they play against each other — there is no absolute score), so it is fine if you don’t have established absolute baselines for quality of play. In the sort of games I’m interested in, like Tangled, we don’t have any baselines so that’s good!
Here’s a way to set up a ranking system for agents in new games. First, we have to choose an arbitrary agent to which we will assign an Elo rating of 1,000. This is fine, as no matter how good or bad the agent is, we can evaluate all our other baseline agents against it and the others will either be higher or lower based on how they do.
What I’m proposing for Tangled is that we generate five baseline agents, the first of which is random, and the other four are pure MCTS agents. We can always do this for any game graph. The four MCTS agents will have 10, 100, 1,000, and 2,000 rollouts; I’ll denote MCTS with k rollouts MCTSₖ. We will assign MCTS₁₀ an Elo of 1,000 and measure the relative performance of the other four baseline agents against it.
Different game graphs will have different Elo scores for these agents. For example, the Ridiculously Small Graph only has four moves per game, so we’d expect even a random agent to do fairly well, and the high rollout MCTS agents should play optimally. However, on a large game graph, it’s unclear how increasing the number of rollouts will affect the relative strengths of the agents.
Once we have decided on our baseline agent group, we then play a large number of games (say N=1,000) where our 1,000 Elo agent plays against each of the others one at a time. We record the number of times the challenging agent wins (call this P) and draws (call this D). The formula for the difference in Elo scores is given by
To estimate the Elo score for our other agents, we just add or subtract this from the baseline Elo of 1,000.
Results for Different Game Graphs
Here are the results for a set of increasingly large game graphs. In my numbering system, the game graphs are as shown in the figure below, with the smallest being graph 11 (the Ridiculously Small Graph), and the largest considered here being graph 14, the first Loupekine snark graph with 22 vertices and 33 edges — still a very small graph.
Analysis & Thoughts
For these small graphs, it looks like there is no advantage to increasing from 1,000 to 2,000 rollouts in MCTS. I suspect that MCTS1000 is playing close to or at optimal play for these small graphs.
There is a high percentage of draws across all small game sizes, which leads to low spread in Elo, which is not what I want. I wonder if the strategy of playing randomly up until the end and then spending a lot of compute in the endgame to figure out how to win or draw by placing your token last might be a good strategy as you might only have to solve the underlying problem once?
It might be nice to do something to the game to make it obvious that gameplay at all stages contributes to win conditions. One way to potentially do this would be to take away the freedom to place your vertex and instead make the entire game about edge coloring. You could do this by choosing a mirror symmetric game graph and having each player ‘own’ one of the sides (like in Chess).
As a classical player strategy, you might be able to attempt to planarize the game graph by knocking out edges that make it non-planar, thereby potentially making the underlying computation easier. This would argue for using graphs that are robust against planarization via edge deletion. Apparently there are 3-regular graph types that are robust against planarization (and of course we don’t have to keep the 3-regular thing, although it does make it easier for human players!). From chatGPT:
The highest planar deletion numbers tend to be found in:
Random cubic expanders (dense non-planar substructures everywhere).
Generalized Petersen Graphs.
Highly connected snarks (Goldberg, Isaacs).
High-genus cage graphs (Tutte’s cages, Foster graphs, e.g. Foster Cage F_12,3 and Tutte’s 8-Cage).
These graphs resist planarization due to their high connectivity, expansion properties, and complex non-planar minors. Many require removing a significant fraction of edges (often order # edges) to become planar.