Description:
- Instead of casting out the tree, we try full assignment of variables and iterate it
- Combines 2 ideas:
- Evaluation by rollouts:
- play multiple games to termination from a state s (using simple, fast or random policy) and count nb of wins/ nb of games explored
- as law of large number, the fraction of wins correlates the true value of the position
- Selective search: explore parts of the tree that help improve the decision at the root, regardless of depth
- Should allocate rollouts to more promising (more win rate) and uncertain nodes (less rollouts,ie less nodes explored)
- so we defined new heuristic, Upper Confidence Bound (UCB1), that combines promising and uncertain idea
- UCB1(n)=N(n)U(n)+C×N(n)log(N(parent(n))) where:
- N(n) is the number of rollouts of node n
- U(n) is the total utility (wins) for Player at parent of node n
Algorithm:
- Using the ideas above
- Repeat until out of time:
- Selection: recursively apply UCB to choose a path down to a leaf node n
- child → grandchild → … → leaf node
- Expansion: add a new child c to n
- Simulation: run a rollout from c
- Backpropagation: update U and N function counts from c back up to the root
- Choose the action leading to the child with highest N
- The value of a node, U(n)/N(n), is a weighted sum of child values
- As n→∞, the vast majority of rollouts are concentrated in the best child(ren), so the weighted average tends to max/min
- No optimality is guaranteed