Markov Decision Process

Description:

A MDP defined with:
- A set of states $s \in S$
- A set of actions $a \in A$
- A Transition Function $T (s, a, s ’)$
  - Probability that $a$ from $s$ leads to $s ’$ , i.e., $P (s ’∣ s, a)$
    - represented with Markov Chain
  - Also called the model or the dynamics
- A reward function $R (s, a, s ’)$
  - Sometimes just $R (s)$ or $R (s ’)$
- A start state
- Maybe a terminal state
Usually denoted as tuple $〈S, A, T, R, 𝛄〉$
Only work if we know all the reward functions and transitions
Better than alpha/beta, expectimax as it eliminates the loop
So you want to:
- Compute optimal values: use value iteration or policy iteration
- Compute values for a particular policy: use policy evaluation
- Turn your values into a policy: use policy extraction (one-step lookahead)

Policies:

For MDPs, we want an optimal policy 𝛑*: S → A
- A policy $p$ gives an action for each state as Markov Chain
- An optimal policy is one that maximizes expected utility if followed
- An explicit policy defines a reflex agent
- Expectimax doesn’t compute entire policies, it computes the action for a single state only
  - as states can be infite and non-recurrent, it is cant be represented with Markov Chain

Optimal quantities:

$V^{*} (s) = max_{a} Q^{*} (s, a)$
- utility (value) of state $s$ , meaning starting in $s$ and acting optimally
- Max of expected utility for action $a$
$Q^{*} (s, a) = s^{'} \sum T (s, a, s^{'}) [R (s, a, s^{'}) + γ V^{*} (s^{'})]$
- expected utility starting out having taken action a from state s and (thereafter) acting optimally
- $Q * =$ Transition Function*(reward of $s^{'}$ coming from $s$ + $γ$ * value of $s^{'}$ )
  - Note that transition funciton is a probablity
Then $V^{*} (s) = max_{a} \sum_{s^{'}} T (s, a, s^{'}) [R (s, a, s^{'}) + γ V^{*} (s^{'})]$
- which depends on children and children depends on more children
- Bellman’s Equation
$π^{*} (s)$ : optimal action from state $s$
Solution to recursive function:
- Time-limited value:
  - Define $V_{k} (s)$ to be the optimal value of $s$ if the game ends in $k$ more steps
    - compute the leaf nodes at the kth steps, then use it to find V* of it parent, and up
  - equivalent to limiting the depth to $k$

Value iteration:

Init: $\forall s : V (s) = 0$
- Starting with $V_{0} (s) = 0$ as we have no time left so no utility for every cells
- except terminal states
Iterate:
- $\forall s : V_{n e w} (s) = m a x_{a} \sum_{s^{'}} T (s, a, s^{'}) [R (s, a, s^{'}) + γV (s^{'})]$
- $V = V_{n e w}$
- Use $V^{*} (s) = a max s^{'} \sum T (s, a, s^{'}) [R (s, a, s^{'}) + γ V^{*} (s^{'})]$ to find best utility of nodes every other nodes
Repeat until convergent, which is $V^{*}$
- Complexity of each iteration: $O (S^{2} A)$
- Calculate $V_{k + 1}$ for $S$ terms, each term calculate all actions for which each actions, we calculate the utility of every other state
theorem will converge to unique optimal values
- Approximations get refined toward optimal value, based on Long-run Proportion of Markov Chain
Policy extraction:
- After we have the optimal value $V^{*} (s)$ , we need to know which action to take
- $π^{*} (s) = ar g max_{a} Q^{*} (s, a) = ar g max_{a} \sum_{s^{'}} T (s, a, s^{'}) [R (s, a, s^{'}) + γ V^{*} (s^{'})]$
  - do a small expectimax
  - action such that it maximize expected utility

Policy centric methods:

Policy Evaluation
Fixed policies:
- We dont change the value of the policy
- The tree would be simpler – only one action to take (according to the fixed policy) per state
  - … though the tree’s value would depend on which policy we fixed
- Define the utility of a state $s$ , under a fixed policy $π$ : $V^𝝿(s) =$ expected total discounted rewards starting in $s$ and following $π$
- $V^{π} (s) = \sum_{s^{'}} T (s, π (s), s^{'}) [R (s, π (s), s^{'}) + γ V^{π} (s^{'})]$
- $O (S^{2})$ per iteration
- Idea 1: turn the recursive bellman equations into updates
  - $V_{0}^{π} (s) = 0$
  - $V_{k + 1}^{π} (s) = \sum_{s^{'}} T (s, π (s), s^{'}) [R (s, π (s), s^{'}) + γ V_{k}^{π} (s^{'})]$
  - Recursive, exponential cost
- Idea 2: without the maxes, the bellman equations becomes linear system
  - Policy matrix * vector of $V^{π} (s_{1})$
  - much simpler, $O (s^{3})$ to solve the matrix
Policy iteration:
- combines policy evaluation and fixed policy
- Step 1: policy evaluation: calculate utilities for some fixed policy (not optimal) until convergent
  - $V_{k + 1}^{π} (s) = s^{'} \sum T (s, π_{i} (s), s^{'}) [R (s, π_{i} (s), s^{'}) + γ V_{k}^{π_{i}} (s^{'})]$
  - updating the value function $V^{π_{i}}$ also updates the policy
- Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal) utilities as future values?
  - $π_{i + 1} (s) = ar g a max s^{'} \sum T (s, a, s^{'}) [R (s, a, s^{'}) + γ V^{π_{i}} (s^{'})]$
- Repeat until policy converge

StrixTheKiet Notes

Explorer

Markov Decision Process

Description:

Policies:

Optimal quantities:

Value iteration:

Policy centric methods:

Linear Programming:

Graph View

Table of Contents

Backlinks