1Introduction¶
Have you ever lost a strategy game against a skilled opponent? It probably seemed like they were ahead of you at every turn. They might have been planning ahead and anticipating your actions, then formulating their strategy to counter yours. If this opponent was a computer, they might have been using one of the strategies that we are about to explore.
%load_ext autoreload
%autoreload 2
from utils import Int, Array, latex, jnp
from enum import IntEnum
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 from utils import Int, Array, latex, jnp
2 from enum import IntEnum
ModuleNotFoundError: No module named 'utils'
2Deterministic, zero sum, fully observable two-player games¶
In this chapter, we will focus on games that are:
- deterministic,
- zero sum (one player wins and the other loses),
- fully observable, that is, the state of the game is perfectly known by both players,
- for two players that alternate turns,
We can represent such a game as a complete game tree. Each possible state is a node in the tree, and since we only consider deterministic games, we can represent actions as edges leading from the current state to the next. Each path through the tree, from root to leaf, represents a single game.
If you could store the complete game tree on a computer, you would be able to win every potentially winnable game by searching all paths from your current state and taking a winning move. We will see an explicit algorithm for this in the next section. However, as games become more complex, it becomes computationally impossible to search every possible path.
For instance, a chess player has roughly 30 actions to choose from at each turn, and each game takes roughly 40 moves per player, so trying to solve chess exactly using minimax would take somewhere on the order of operations. That’s 10 billion billion billion billion billion billion billion billion billion billion billion billion billion operations. As of the time of writing, the fastest processor can achieve almost 10 GHz (10 billion operations per second), so to fully solve chess using minimax is many, many orders of magnitude out of reach.
It is thus intractable, in any realistic setting, to solve the complete game tree exactly. Luckily, only a small fraction of those games ever occur in reality; Later in this chapter, we will explore ways to prune away parts of the tree that we know we can safely ignore. We can also approximate the value of a state without fully evaluating it. Using these approximations, we can no longer guarantee winning the game, but we can come up with strategies that will do well against most opponents.
2.1Notation¶
Let us now describe these games formally. We’ll call the first player Max and the second player Min. Max seeks to maximize the final game score, while Min seeks to minimize the final game score.
- We’ll use to denote the set of all possible game states.
- The game begins in some initial state .
- Max moves on even turn numbers , and Min moves on odd turn numbers , where is a natural number.
- The space of possible actions, ,
depends on the state itself, as well as whose turn it is.
(For example, in tic-tac-toe, Max can only play
X
s while Min can only playO
s.) - The game ends after total moves (which might be even or odd). We call the final state a terminal state.
- denotes the state transitions, that is, denotes the resulting state when taking action in state . We’ll assume that this function is time-homogeneous (a.k.a. stationary) and doesn’t change across timesteps.
- denotes the game score of the terminal state . Note that this is some positive or negative value seen by both players: A positive value indicates Max winning, a negative value indicates Min winning, and a value of 0 indicates a tie.
We also call the sequence of states and actions a trajectory.
Our notation may remind you of Markov decision processes. Given that these games also involve a sequence of states and actions, can we formulate them as finite-horizon MDPs? The two settings are not exactly analogous, since in MDPs we only consider a single policy, while these games involve two distinct players with opposite objectives. Since we want to analyze the behavior of both players at the same time, describing such a game as an MDP is more trouble than it’s worth.
class Player(IntEnum):
EMPTY = 0
X = 1
O = 2
class TicTacToeEnv(gym.Env):
metadata = {"render.modes": ["human"]}
def __init__(self):
super().__init__()
self.action_space = spaces.Discrete(9)
self.observation_space = spaces.Box(
low=0, high=2, shape=(3, 3), dtype=jnp.int32
)
self.board = None
self.current_player = None
self.done = None
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self.board = jnp.zeros((3, 3), dtype=jnp.int32)
self.current_player = Player.X
self.done = False
return self.board, {}
def step(self, action: jnp.int32) -> Int[Array, "3 3"]:
"""Take the action a in state s."""
if self.done:
raise ValueError("The game is already over. Call `env.reset()` to reset the environment.")
row, col = divmod(action, 3)
if self.board[row, col] != Player.EMPTY:
return self.board, -10
return s.at[row, col].set(player)
@staticmethod
def is_terminal(s: Int[Array, "3 3"]):
"""Check if the game is over."""
return is_winner(s, Player.X) or is_winner(s, Player.O) or jnp.all(s == Player.EMPTY)
@staticmethod
def is_winner(board: Int[Array, "3 3"], player: Player):
"""Check if the given player has won."""
return any(
jnp.all(board[i, :] == player) or
jnp.all(board[:, i] == player)
for i in range(3)
) or jnp.all(jnp.diag(board) == player) or jnp.all(jnp.diag(jnp.fliplr(board)) == player)
@staticmethod
def show(s: Int[Array, "3 3"]):
"""Print the board."""
for row in range(3):
print(" | ".join(" XO"[s[row, col]] for col in range(3)))
if row < 2:
print("-" * 5)
3Min-max search¶
In the introduction, we claimed that we could win any potentially winnable game by looking ahead and predicting the opponent’s actions. This would mean that each nonterminal state already has some predetermined game score, that is, in each state, it is already “obvious” which player is going to win.
Let denote the game score under optimal play from both players starting in state at time .
We can compute this by starting at the terminal states, when the game’s outcome is known, and working backwards, assuming that Max chooses the action that leads to the highest score and Min chooses the action that leads to the lowest score.
This translates directly into a recursive depth-first search algorithm for searching the complete game tree.
def minimax_search(s, player) -> tuple["Action", "Value"]:
"""Return the value of the state (for Max) and the best action for Max to take."""
if env.is_terminal(s):
return None, env.winner(s)
if player is max:
a_max, v_max = None, None
for a in env.action_space(s):
_, v = minimax_search(env.step(s, a), min)
if v > v_max:
a_max, v_max = a, v
return a_max, v_max
else:
a_min, v_min = None, None
for a in env.action_space(s):
_, v = minimax_search(env.step(s, a), max)
if v < v_min:
a_min, v_min = a, v
return a_min, v_min
latex(minimax_search, custom_identifiers={"env.step": "P", "env.action_space": r"\mathcal{A}"})
3.1Complexity of min-max search¶
At each of the timesteps, this algorithm iterates through the entire action space at that state, and therefore has a time complexity of (where is the largest number of actions possibly available at once). This makes the min-max algorithm impractical for even moderately sized games.
But do we need to compute the exact value of every possible state? Instead, is there some way we could “ignore” certain actions and their subtrees if we already know of better options? The alpha-beta search makes use of this intuition.
4Alpha-beta search¶
The intuition behind alpha-beta search is as follows: Suppose Max is in state , and considering whether to take action or . If at any point they find out that action is definitely worse than (or equal to) action , they don’t need to evaluate action any further.
Concretely, we run min-max search as above, except now we keep track of two additional parameters and while evaluating each state:
- Starting in state , Max can achieve a game score of at least assuming Min plays optimally. That is, at all points.
- Analogously, starting in state , Min can ensure a game score of at most assuming Max plays optimally. That is, at all points.
Suppose we are evaluating , where it is Max’s turn ( is even). We update to be the highest minimax value achievable from so far. That is, the value of is at least . Suppose Max chooses action , which leads to state , in which it is Min’s turn. If any of Min’s actions in achieve a value , we know that Max would not choose action , since they know that it is worse than whichever action gave the value . Similarly, to evaluate a state on Min’s turn, we update to be the lowest value achievable from so far. That is, the value of is at most . Suppose Min chooses action , which leads to state for Max. If Max has any actions that do better than , they would take it, making action a suboptimal choice for Min.
def alpha_beta_search(s, player, alpha, beta) -> tuple["Action", "Value"]:
"""Return the value of the state (for Max) and the best action for Max to take."""
if env.is_terminal(s):
return None, env.winner(s)
if player is max:
a_max, v_max = None, None
for a in actions:
_, v = minimax_search(env.step(s, a), min, alpha, beta)
if v > v_max:
a_max, v_max = a, v
alpha = max(alpha, v)
if v_max >= beta:
# we know Min will not choose the action that leads to this state
return a_max, v_max
return a_max, v_max
else:
a_min, v_min = None, None
for a in actions:
_, v = minimax_search(env.step(s, a), max)
if v < v_min:
a_min, v_min = a, v
beta = min(beta, v)
if v_min <= alpha:
# we know Max will not choose the action that leads to this state
return a_min, v_min
return a_min, v_min
latex(alpha_beta_search)
How do we choose what order to explore the branches? As you can tell, this significantly affects the efficiency of the pruning algorithm. If Max explores the possible actions in order from worst to best, they will not be able to prune any branches at all! Additionally, to verify that an action is suboptimal, we must run the search recursively from that action, which ultimately requires traversing the tree all the way to a leaf node. The longer the game might possibly last, the more computation we have to run.
In practice, we can often use background information about the game to develop a heuristic for evaluating possible actions. If a technique is based on background information or intuition, especially if it isn’t rigorously justified, we call it a heuristic.
Can we develop heuristic methods for tree exploration that works for all sorts of games?
5Monte Carlo Tree Search¶
The task of evaluating actions in a complex environment might seem familiar. We’ve encountered this problem before in both the multi-armed bandits setting and the Markov decision process setting. Now we’ll see how to combine concepts from these to form a more general and efficient tree search heuristic called Monte Carlo Tree Search (MCTS).
When a problem is intractable to solve exactly, we often turn to approximate algorithms that sacrifice some accuracy in exchange for computational efficiency. MCTS also improves on alpha-beta search in this sense. As the name suggests, MCTS uses Monte Carlo simulation, that is, collecting random samples and computing the sample statistics, in order to approximate the value of each action.
As before, we imagine a complete game tree in which each path represents an entire game. The goal of MCTS is to assign values to only the game states that are relevant to the current game; We gradually expand the tree at each move. For comparison, in alpha-beta search, the entire tree only needs to be solved once, and from then on, choosing an action is as simple as taking a maximum over the previously computed values.
The crux of MCTS is approximating the win probability of a state by a sample probability. In practice, MCTS is used for games with binary outcomes where , and so this is equivalent to approximating the final game score. To approximate the win probability from state , MCTS samples random games starting in and computes the sample proportion of those that the player wins.
Note that, for a given state , choosing the best action can be framed as a multi-armed bandits problem, where each action corresponds to an arm, and the reward distribution of arm is the distribution of the game score over random games after choosing that arm. The most commonly used bandit algorithm in practice for MCTS is the Upper Confidence Bound (UCB) algorithm.
This means that, for each edge (corresponding to a state-action pair ) in the game tree, we keep track of the statistics required to compute its UCB:
- How many times it has been “visited” ()
- How many of those visits resulted in victory (). Let us call this latter value (for number of “wins”).
What does refer to in the above expressions? Recall refers to the number of time steps elapsed in the bandit environment. As mentioned above, each state corresponds to its own bandit environment, and so refers to , that is, how many actions have been taken from state . This term, , gets incremented as the algorithm runs; for simplicity, we won’t introduce another index to track how it changes.
The application which brought the MCTS algorithm to fame was DeepMind’s AlphaGo Silver et al. (2016). Since then, it has been used in numerous applications ranging from games to automated theorem proving.
How accurate is this Monte Carlo estimation? It depends heavily on the rollout policy . If the distribution induces over games is very different from the distribution seen during real gameplay, we might end up with a poor value approximation.
5.1Incorporating value functions and policies¶
To remedy this, we might make use of a value function that more efficiently approximates the value of a state. Then, we can replace the simulation step of MCTS with evaluating , where .
We might also make use of a “guiding” policy that provides “intuition” as to which actions are more valuable in a given state. We can scale the exploration term of (4) according to the policy’s outputs.
Putting these together, we can describe an updated version of MCTS that makes use of these value functions and policy:
class EdgeStatistics(NamedTuple):
wins: int = 0
visits: int = 0
class MCTSTree:
"""A representation of the search tree.
Maps each state-action pair to its number of wins and the number of visits.
"""
edges: dict[tuple["State", "Action"], EdgeStatistics]
def mcts_iter(tree, s_init):
s = s_init
while all((s, a) in tree for a in env.action_state(s)):
How do we actually compute a useful and ? If we have some existing dataset of trajectories, we could use supervised learning (that is, imitation learning) to generate a policy via behavioral cloning and learn by regressing the game outcomes onto states. Then, plugging these into the above algorithm results in a stronger policy by using tree search to “think ahead”.
But we don’t have to stop at just one improvement step; we could iterate this process via self-play.
5.2Self-play¶
Recall the policy iteration algorithm from the MDPs chapter. Policy iteration alternates between policy evaluation (taking π and computing ) and policy improvement (setting π to be greedy with respect to ). Above, we saw how MCTS can be thought of as a “policy improvement” operation: for a given policy , we can use it to guide MCTS, resulting in an algorithm that is itself a policy that maps from states to actions. Now, we can use behavioral cloning to obtain a new policy that imitates . We can now use to guide MCTS, and repeat.
This algorithm was brought to fame by AlphaGo Zero Silver et al. (2017).
6Summary¶
In this chapter, we explored tree search-based algorithms for deterministic, zero sum, fully observable two-player games. We began with min-max search, an algorithm for exactly solving the game value of every possible state. However, this is impossible to execute in practice, and so we must resort to various ways to reduce the number of states and actions that we must explore. Alpha-beta search does this by pruning away states that we already know to be suboptimal, and Monte Carlo Tree Search approximates the value of states instead of evaluating them exactly.
7References¶
Chapter 5 of Russell & Norvig (2021) provides an excellent overview of search methods in games. The original AlphaGo paper Silver et al. (2016) was a groundbreaking application of these technologies. Silver et al. (2017) removed the imitation learning phase, learning from scratch. AlphaZero Silver et al. (2018) then extended to other games beyond Go, namely shogi and chess, also learning from scratch. In MuZero Schrittwieser et al. (2020), this was further extended by learning a model of the game dynamics.
- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587), 484–489. 10.1038/nature16961
- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., & Hassabis, D. (2017). Mastering the Game of Go without Human Knowledge. Nature, 550(7676), 354–359. 10.1038/nature24270
- Russell, S. J., & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (Fourth edition). Pearson.
- Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go through Self-Play. Science, 362(6419), 1140–1144. 10.1126/science.aar6404
- Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., & Silver, D. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 588(7839), 604–609. 10.1038/s41586-020-03051-4