Skip to article frontmatterSkip to article content

8 Tree Search Methods

1Introduction

Have you ever lost a strategy game against a skilled opponent? It probably seemed like they were ahead of you at every turn. They might have been planning ahead and anticipating your actions, then formulating their strategy to counter yours. If this opponent was a computer, they might have been using one of the strategies that we are about to explore.

%load_ext autoreload
%autoreload 2
from utils import Int, Array, latex, jnp
from enum import IntEnum
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from utils import Int, Array, latex, jnp
      2 from enum import IntEnum

ModuleNotFoundError: No module named 'utils'

2Deterministic, zero sum, fully observable two-player games

In this chapter, we will focus on games that are:

  • deterministic,
  • zero sum (one player wins and the other loses),
  • fully observable, that is, the state of the game is perfectly known by both players,
  • for two players that alternate turns,

We can represent such a game as a complete game tree. Each possible state is a node in the tree, and since we only consider deterministic games, we can represent actions as edges leading from the current state to the next. Each path through the tree, from root to leaf, represents a single game.

The first two layers of the complete game tree of tic-tac-toe.
From Wikimedia.

The first two layers of the complete game tree of tic-tac-toe. From Wikimedia.

If you could store the complete game tree on a computer, you would be able to win every potentially winnable game by searching all paths from your current state and taking a winning move. We will see an explicit algorithm for this in the next section. However, as games become more complex, it becomes computationally impossible to search every possible path.

For instance, a chess player has roughly 30 actions to choose from at each turn, and each game takes roughly 40 moves per player, so trying to solve chess exactly using minimax would take somewhere on the order of 30801011830^{80} \approx 10^{118} operations. That’s 10 billion billion billion billion billion billion billion billion billion billion billion billion billion operations. As of the time of writing, the fastest processor can achieve almost 10 GHz (10 billion operations per second), so to fully solve chess using minimax is many, many orders of magnitude out of reach.

It is thus intractable, in any realistic setting, to solve the complete game tree exactly. Luckily, only a small fraction of those games ever occur in reality; Later in this chapter, we will explore ways to prune away parts of the tree that we know we can safely ignore. We can also approximate the value of a state without fully evaluating it. Using these approximations, we can no longer guarantee winning the game, but we can come up with strategies that will do well against most opponents.

2.1Notation

Let us now describe these games formally. We’ll call the first player Max and the second player Min. Max seeks to maximize the final game score, while Min seeks to minimize the final game score.

  • We’ll use S\mathcal{S} to denote the set of all possible game states.
  • The game begins in some initial state s0Ss_0 \in \mathcal{S}.
  • Max moves on even turn numbers h=2nh = 2n, and Min moves on odd turn numbers h=2n+1h = 2n+1, where nn is a natural number.
  • The space of possible actions, Ah(s)\mathcal{A}_h(s), depends on the state itself, as well as whose turn it is. (For example, in tic-tac-toe, Max can only play Xs while Min can only play Os.)
  • The game ends after HH total moves (which might be even or odd). We call the final state a terminal state.
  • PP denotes the state transitions, that is, P(s,a)P(s, a) denotes the resulting state when taking action aA(s)a \in \mathcal{A}(s) in state ss. We’ll assume that this function is time-homogeneous (a.k.a. stationary) and doesn’t change across timesteps.
  • r(s)r(s) denotes the game score of the terminal state ss. Note that this is some positive or negative value seen by both players: A positive value indicates Max winning, a negative value indicates Min winning, and a value of 0 indicates a tie.

We also call the sequence of states and actions a trajectory.

Our notation may remind you of Markov decision processes. Given that these games also involve a sequence of states and actions, can we formulate them as finite-horizon MDPs? The two settings are not exactly analogous, since in MDPs we only consider a single policy, while these games involve two distinct players with opposite objectives. Since we want to analyze the behavior of both players at the same time, describing such a game as an MDP is more trouble than it’s worth.

class Player(IntEnum):
    EMPTY = 0
    X = 1
    O = 2


class TicTacToeEnv(gym.Env):
    metadata = {"render.modes": ["human"]}

    def __init__(self):
        super().__init__()
        self.action_space = spaces.Discrete(9)
        self.observation_space = spaces.Box(
            low=0, high=2, shape=(3, 3), dtype=jnp.int32
        )
        self.board = None
        self.current_player = None
        self.done = None

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.board = jnp.zeros((3, 3), dtype=jnp.int32)
        self.current_player = Player.X
        self.done = False
        return self.board, {}

    def step(self, action: jnp.int32) -> Int[Array, "3 3"]:
        """Take the action a in state s."""
        if self.done:
            raise ValueError("The game is already over. Call `env.reset()` to reset the environment.")
        
        row, col = divmod(action, 3)
        if self.board[row, col] != Player.EMPTY:
            return self.board, -10
        return s.at[row, col].set(player)

    @staticmethod
    def is_terminal(s: Int[Array, "3 3"]):
        """Check if the game is over."""
        return is_winner(s, Player.X) or is_winner(s, Player.O) or jnp.all(s == Player.EMPTY)
    
    @staticmethod
    def is_winner(board: Int[Array, "3 3"], player: Player):
        """Check if the given player has won."""
        return any(
            jnp.all(board[i, :] == player) or
            jnp.all(board[:, i] == player)
            for i in range(3)
        ) or jnp.all(jnp.diag(board) == player) or jnp.all(jnp.diag(jnp.fliplr(board)) == player)
    
    @staticmethod
    def show(s: Int[Array, "3 3"]):
        """Print the board."""
        for row in range(3):
            print(" | ".join(" XO"[s[row, col]] for col in range(3)))
            if row < 2:
                print("-" * 5)

In the introduction, we claimed that we could win any potentially winnable game by looking ahead and predicting the opponent’s actions. This would mean that each nonterminal state already has some predetermined game score, that is, in each state, it is already “obvious” which player is going to win.

Let Vh(s)V_\hi^\star(s) denote the game score under optimal play from both players starting in state ss at time h\hi.

We can compute this by starting at the terminal states, when the game’s outcome is known, and working backwards, assuming that Max chooses the action that leads to the highest score and Min chooses the action that leads to the lowest score.

This translates directly into a recursive depth-first search algorithm for searching the complete game tree.

def minimax_search(s, player) -> tuple["Action", "Value"]:
    """Return the value of the state (for Max) and the best action for Max to take."""
    if env.is_terminal(s):
        return None, env.winner(s)

    if player is max:
        a_max, v_max = None, None
        for a in env.action_space(s):
            _, v = minimax_search(env.step(s, a), min)
            if v > v_max:
                a_max, v_max = a, v
        return a_max, v_max
    else:
        a_min, v_min = None, None
        for a in env.action_space(s):
            _, v = minimax_search(env.step(s, a), max)
            if v < v_min:
                a_min, v_min = a, v
        return a_min, v_min

latex(minimax_search, custom_identifiers={"env.step": "P", "env.action_space": r"\mathcal{A}"})

At each of the H\hor timesteps, this algorithm iterates through the entire action space at that state, and therefore has a time complexity of HnA\hor^{n_A} (where nAn_A is the largest number of actions possibly available at once). This makes the min-max algorithm impractical for even moderately sized games.

But do we need to compute the exact value of every possible state? Instead, is there some way we could “ignore” certain actions and their subtrees if we already know of better options? The alpha-beta search makes use of this intuition.

The intuition behind alpha-beta search is as follows: Suppose Max is in state ss, and considering whether to take action aa or aa'. If at any point they find out that action aa' is definitely worse than (or equal to) action aa, they don’t need to evaluate action aa' any further.

Concretely, we run min-max search as above, except now we keep track of two additional parameters α(s)\alpha(s) and β(s)\beta(s) while evaluating each state:

  • Starting in state ss, Max can achieve a game score of at least α(s)\alpha(s) assuming Min plays optimally. That is, Vh(s)α(s)V^\star_\hi(s) \ge \alpha(s) at all points.
  • Analogously, starting in state ss, Min can ensure a game score of at most β(s)\beta(s) assuming Max plays optimally. That is, Vh(s)β(s)V^\star_\hi(s) \le \beta(s) at all points.

Suppose we are evaluating Vh(s)V^\star_\hi(s), where it is Max’s turn (h\hi is even). We update α(s)\alpha(s) to be the highest minimax value achievable from ss so far. That is, the value of ss is at least α(s)\alpha(s). Suppose Max chooses action aa, which leads to state ss', in which it is Min’s turn. If any of Min’s actions in ss' achieve a value Vh+1(s)α(s)V^\star_{\hi+1}(s') \le \alpha(s), we know that Max would not choose action aa, since they know that it is worse than whichever action gave the value α(s)\alpha(s). Similarly, to evaluate a state on Min’s turn, we update β(s)\beta(s) to be the lowest value achievable from ss so far. That is, the value of ss is at most β(s)\beta(s). Suppose Min chooses action aa, which leads to state ss' for Max. If Max has any actions that do better than β(s)\beta(s), they would take it, making action aa a suboptimal choice for Min.

def alpha_beta_search(s, player, alpha, beta) -> tuple["Action", "Value"]:
    """Return the value of the state (for Max) and the best action for Max to take."""
    if env.is_terminal(s):
        return None, env.winner(s)

    if player is max:
        a_max, v_max = None, None
        for a in actions:
            _, v = minimax_search(env.step(s, a), min, alpha, beta)
            if v > v_max:
                a_max, v_max = a, v
                alpha = max(alpha, v)
            if v_max >= beta:
                # we know Min will not choose the action that leads to this state
                return a_max, v_max
        return a_max, v_max

    else:
        a_min, v_min = None, None
        for a in actions:
            _, v = minimax_search(env.step(s, a), max)
            if v < v_min:
                a_min, v_min = a, v
                beta = min(beta, v)
            if v_min <= alpha:
                # we know Max will not choose the action that leads to this state
                return a_min, v_min
        return a_min, v_min


latex(alpha_beta_search)

How do we choose what order to explore the branches? As you can tell, this significantly affects the efficiency of the pruning algorithm. If Max explores the possible actions in order from worst to best, they will not be able to prune any branches at all! Additionally, to verify that an action is suboptimal, we must run the search recursively from that action, which ultimately requires traversing the tree all the way to a leaf node. The longer the game might possibly last, the more computation we have to run.

In practice, we can often use background information about the game to develop a heuristic for evaluating possible actions. If a technique is based on background information or intuition, especially if it isn’t rigorously justified, we call it a heuristic.

Can we develop heuristic methods for tree exploration that works for all sorts of games?

The task of evaluating actions in a complex environment might seem familiar. We’ve encountered this problem before in both the multi-armed bandits setting and the Markov decision process setting. Now we’ll see how to combine concepts from these to form a more general and efficient tree search heuristic called Monte Carlo Tree Search (MCTS).

When a problem is intractable to solve exactly, we often turn to approximate algorithms that sacrifice some accuracy in exchange for computational efficiency. MCTS also improves on alpha-beta search in this sense. As the name suggests, MCTS uses Monte Carlo simulation, that is, collecting random samples and computing the sample statistics, in order to approximate the value of each action.

As before, we imagine a complete game tree in which each path represents an entire game. The goal of MCTS is to assign values to only the game states that are relevant to the current game; We gradually expand the tree at each move. For comparison, in alpha-beta search, the entire tree only needs to be solved once, and from then on, choosing an action is as simple as taking a maximum over the previously computed values.

The crux of MCTS is approximating the win probability of a state by a sample probability. In practice, MCTS is used for games with binary outcomes where r(s){+1,1}r(s) \in \{ +1, -1 \}, and so this is equivalent to approximating the final game score. To approximate the win probability from state ss, MCTS samples random games starting in ss and computes the sample proportion of those that the player wins.

Note that, for a given state ss, choosing the best action aa can be framed as a multi-armed bandits problem, where each action corresponds to an arm, and the reward distribution of arm kk is the distribution of the game score over random games after choosing that arm. The most commonly used bandit algorithm in practice for MCTS is the Upper Confidence Bound (UCB) algorithm.

This means that, for each edge (corresponding to a state-action pair (s,a)(s, a)) in the game tree, we keep track of the statistics required to compute its UCB:

  • How many times it has been “visited” (Nts,aN_t^{s, a})
  • How many of those visits resulted in victory (τ=0t11{(sτ,aτ)=(s,a)}rτ\sum_{\tau=0}^{t-1} \ind{(s_\tau, a_\tau) = (s, a)} r_\tau). Let us call this latter value Wts,aW^{s, a}_t (for number of “wins”).

What does tt refer to in the above expressions? Recall tt refers to the number of time steps elapsed in the bandit environment. As mentioned above, each state ss corresponds to its own bandit environment, and so tt refers to NsN^s, that is, how many actions have been taken from state ss. This term, NsN^s, gets incremented as the algorithm runs; for simplicity, we won’t introduce another index to track how it changes.

The application which brought the MCTS algorithm to fame was DeepMind’s AlphaGo Silver et al. (2016). Since then, it has been used in numerous applications ranging from games to automated theorem proving.

How accurate is this Monte Carlo estimation? It depends heavily on the rollout policy πrollout\pi_\text{rollout}. If the distribution πrollout\pi_\text{rollout} induces over games is very different from the distribution seen during real gameplay, we might end up with a poor value approximation.

5.1Incorporating value functions and policies

To remedy this, we might make use of a value function v:SRv : \mathcal{S} \to \mathbb{R} that more efficiently approximates the value of a state. Then, we can replace the simulation step of MCTS with evaluating r=v(snext)r = v(s_\text{next}), where snext=P(snew,anew)s_\text{next} = P(s_\text{new}, a_\text{new}).

We might also make use of a “guiding” policy πguide:S(A)\pi_\text{guide} : \mathcal{S} \to \triangle(\mathcal{A}) that provides “intuition” as to which actions are more valuable in a given state. We can scale the exploration term of (4) according to the policy’s outputs.

Putting these together, we can describe an updated version of MCTS that makes use of these value functions and policy:

class EdgeStatistics(NamedTuple):
    wins: int = 0
    visits: int = 0

class MCTSTree:
    """A representation of the search tree.

    Maps each state-action pair to its number of wins and the number of visits.
    """

    edges: dict[tuple["State", "Action"], EdgeStatistics]

def mcts_iter(tree, s_init):
    s = s_init
    while all((s, a) in tree for a in env.action_state(s)):

How do we actually compute a useful πguide\pi_\text{guide} and vv? If we have some existing dataset of trajectories, we could use supervised learning (that is, imitation learning) to generate a policy πguide\pi_\text{guide} via behavioral cloning and learn vv by regressing the game outcomes onto states. Then, plugging these into the above algorithm results in a stronger policy by using tree search to “think ahead”.

But we don’t have to stop at just one improvement step; we could iterate this process via self-play.

5.2Self-play

Recall the policy iteration algorithm from the MDPs chapter. Policy iteration alternates between policy evaluation (taking π and computing VπV^\pi) and policy improvement (setting π to be greedy with respect to VπV^\pi). Above, we saw how MCTS can be thought of as a “policy improvement” operation: for a given policy π0\pi^0, we can use it to guide MCTS, resulting in an algorithm that is itself a policy πMCTS0\pi^0_\text{MCTS} that maps from states to actions. Now, we can use behavioral cloning to obtain a new policy π1\pi^1 that imitates πMCTS0\pi^0_\text{MCTS}. We can now use π1\pi^1 to guide MCTS, and repeat.

This algorithm was brought to fame by AlphaGo Zero Silver et al. (2017).

6Summary

In this chapter, we explored tree search-based algorithms for deterministic, zero sum, fully observable two-player games. We began with min-max search, an algorithm for exactly solving the game value of every possible state. However, this is impossible to execute in practice, and so we must resort to various ways to reduce the number of states and actions that we must explore. Alpha-beta search does this by pruning away states that we already know to be suboptimal, and Monte Carlo Tree Search approximates the value of states instead of evaluating them exactly.

7References

Chapter 5 of Russell & Norvig (2021) provides an excellent overview of search methods in games. The original AlphaGo paper Silver et al. (2016) was a groundbreaking application of these technologies. Silver et al. (2017) removed the imitation learning phase, learning from scratch. AlphaZero Silver et al. (2018) then extended to other games beyond Go, namely shogi and chess, also learning from scratch. In MuZero Schrittwieser et al. (2020), this was further extended by learning a model of the game dynamics.

References
  1. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587), 484–489. 10.1038/nature16961
  2. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., & Hassabis, D. (2017). Mastering the Game of Go without Human Knowledge. Nature, 550(7676), 354–359. 10.1038/nature24270
  3. Russell, S. J., & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (Fourth edition). Pearson.
  4. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go through Self-Play. Science, 362(6419), 1140–1144. 10.1126/science.aar6404
  5. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., & Silver, D. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 588(7839), 604–609. 10.1038/s41586-020-03051-4