AlphaZero is a generic algorithm that can solve any perfect information board game.
Board games are interesting for robots because they help us understand whether they can reason in complex environments where the rules are known. If robots ever take over, they will probably fund their campaign by playing board games online for money.
There’s tons of new board games being published every year which is great because different board games test vastly different skills
- Chess is about reasoning through a deep tree of actions
- Go is about depth and spatial understanding
- Poker is about probabilities and seeing others intent
- Diplomacy is about negotiation and backstabbing
Chess and Go are examples of full information games where the entire history of each match and the current state of the board if fully available to both players at all time.
Poker and Diplomacy are examples of hidden information games where the actual state of a player (their cards and intents) can’t be directly observed but they can inferred.
Games that look like Chess or Go like Reversi, Mancala are all solved via the AlphaZero algorithm.
Games that look like Poker or Diplomacy like Bridge or Hanabi are solved sometimes by newer research involving counterfactual regret minimization but not without a lot of caveats.
This post will show you how solve board games that look like Chess or Go using AlphaZero.
AlphaZero doesn’t use human games as data, it’s trained by playing against itself. AlphaZero’s precursor AlphaGo both used human data and all sorts of custom features which were subsequently removed in AlphaZero. This is a big deal because it means that chess amateurs can build cutting edge chess engines while in the past you needed to pair up programmers with chess grand-masters.
TL;DR: The general idea of AlphaZero is to construct a tree which would keep track of which moves are good or bad in certain positions by randomly simulating games starting from those positions. Because the trees for Chess and Go are so massive, we use a neural network to guide the search process.
When you annotate a chess game the typical representation is a list, shown to the right in the below screenshot.
Now it’s black’s turn and black has many options. Some reasonable ones (not the best ones) are dxe4, e6 etc. — we can draw out the new state the board will be in as a tree.
We can keep going in this manner and draw out a full tree for all possible chess positions and annotate each action by a score where +0.99 means it’s an amazing move and -0.99 means it’s an awful move.
Theoretically we could store this mapping between positions, actions and scores in a dictionary. The algorithm for playing superhuman chess becomes as simple as
- We look up all the rows of our given position
- Sort all actions by score
- Pick the highest ranking action
- Repeat until win
But how do we generate this dictionary in the first place?
Monte Carlo Tree Search
Let’s say we’re in the below board position. Who is winning?
The main idea of Monte Carlo Tree Search is that we simulate tons of games starting from this position where we make each player play randomly. We then look at those games and study the outcome, if white wins in 60% of the games then that means that the above position has a score of +0.6.
But how do we decide which actions to explore first?
The key idea is to think of our computational budget as fixed, if we have infinite time then we can run this algorithm infinitely many times to get the correct result but physically we can’t do this. So instead we’ll use a strategy called upper confidence bound for trees or UCT.
During tree search some branches will be reached more often which means we’ll have a better estimate of how good that state is. Some states may never be visited so we can never even produce an estimate of how good they are even if they produce great moves. We get back to the classic issue of exploration vs exploitation in Reinforcement Learning.
So instead of picking the node with the highest win percentage w, we pick the node that maximizes the UCT formula
- w is the current estimate of winning percentage — exploitation
- n is the number o simulations started at the considered node
- The square root expression is large when a node hasn’t been explored much — exploration
- c is a constant to balance between exploration and exploitation — during training we want c to be close to 1 and during testing we want c to be close to 0
Once the node is chosen we simulate tons of games, record the statistics and update the value of the node accordingly. In the case of Chess, the best move at any point is the best move with respect to the best move your opponent could play.
We’re left with one final issue — even if we train MCTS for a long time there’s no way we’re covering all 10¹²⁰ possible states in Chess, so we need a way to let our agent play a strong action even when it hasn’t seen a particular state before.
Actor Critic Network
AlphaZero uses Reinforcement Learning algorithm to guide the Tree Search process.
The specific type of Reinforcement Learning algorithm used is an Actor Critic network which is a single network with 2 outputs
- Actor: A policy function π(s) which outputs an action a
- Critic: A value function which outputs the value of that action V(a) ∈ [-1,1]
Some people also add a third optional output, the advantage A(s,a) = Q(s,a) - V(a) .
While training a neural network is time consuming, evaluating it is generally fast especially with a GPU so we can use a neural network to come up with fast estimates of what the next best action is. The model can be trained by the self play data generated by the MCTS process itself.
MCTS is like your memory and an Actor Critic network is like your intuition
And that’s all there is to AlphaZero, let’s look at how a Julia Library implements AlphaZero.
Code Deep Dive: chess-alpha-zero.py
Unfortunately, I couldn’t find a Julia library which implements AlphaZero for Chess.
So instead we’ll be looking at chess-alpha-zero to see how everything works.
At a glance
- The binder folder holds information regarding GPU configuration which we can ignore
- All the files in main directory can also be safely ignored
- data/model holds the trained model as an h5 and json file which you would then load to play against the model
- notebooks/ has an example of how to use chess-alpha-zero
Which leaves us with src/chess_zero where the serious stuff is going on. Most of the files in this repo are about how to setup a Chess bot that you can play against and how to manage distributed training.
But for our purposes we only need to understand 2 files
Takes care of instantiating and maintaining the Actor Critic algorithm.
The model is built using Keras and is a great example of how to create a neural network with multiple outputs. There’s 3 main parts
- Common layers between policy and value network
2. Policy Network which outputs an action
3. Value Network which outputs the value of the chosen action
Takes care of the Monte Carlo Tree Search Algorithm
- Stats for each node in the Tree
2. Tree Search Select Step
3. Tree Search Backup Step
And that’s it! It’s probably still worth going over the rest of the codebase to see how the full infrastructure works but you now understand the key parts.
- Game Changer — if you really enjoy Chess this is a book that analyzes the chess games that AlphaZero played, I’ve analyzed chess games for most of my life and it definitely felt like AlphaZero was playing a new kind of chess. A sort of Tal on steroids.
- Get AlphaZero working on another not so popular board game that you love. If you also end up programming the board game engine from scratch you’re bound to find a couple of researchers interested in bench-marking their algorithms on your game.
- AlphaGo.jl — a Julia implementation with very clean code. This is an AlphaGo not AlphaZero implementation which means you’ll see a bit more code regarding how to load human game data and board featurization.
- Leela-zero — is an open source project attempting to replicate the results that Google got. The compute required to make AlphaZero work was data center big. Leela-zero will show you how to scale everything I talked about to a distributed setting. It’s also notoriously difficult to find good hyper-parameters in Reinforcement Learning so any compute donations would be highly appreciated.
- Read Deep Learning and the Game of Go — this is my favorite introduction to Machine Learning textbook. It’ll go over all the main components that you need to deploy a fully working Reinforcement Learning system including a Go engine, Go web service that you can play against, a data loader to bootstrap training.
- DeepStack — is a project that solves simplified variants of Poker. The main idea is to use No Regret learning which is a way of updating your past beliefs based on the new cards that your opponents played. DeepStack also more explicitly uses ideas from Game Theory while AlphaZero does not. I’ve been sitting on a draft blog post on this topic for a couple of months so if you’d really like to read it, please let me know!