Description:

  • A MDP defined with:
    • A set of states
    • A set of actions
    • A Transition Function
      • Probability that from leads to , i.e.,
      • Also called the model or the dynamics
    • A reward function
      • Sometimes just or
    • A start state
    • Maybe a terminal state
  • Usually denoted as tuple 〈S, A, T, R, 𝛄〉
  • Only work if we know all the reward functions and transitions
  • Better than alpha/beta, expectimax as it eliminates the loop
  • So you want to:
    • Compute optimal values: use value iteration or policy iteration
    • Compute values for a particular policy: use policy evaluation
    • Turn your values into a policy: use policy extraction (one-step lookahead)

Policies:

  • For MDPs, we want an optimal policy 𝛑*: S → A
    • A policy gives an action for each state as Markov Chain
    • An optimal policy is one that maximizes expected utility if followed
    • An explicit policy defines a reflex agent
    • Expectimax doesn’t compute entire policies, it computes the action for a single state only
      • as states can be infite and non-recurrent, it is cant be represented with Markov Chain

Optimal quantities:

    • utility (value) of state , meaning starting in and acting optimally
    • Max of expected utility for action
    • expected utility starting out having taken action a from state s and (thereafter) acting optimally
    • Transition Function*(reward of coming from + * value of )
      • Note that transition funciton is a probablity
  • Then
  • : optimal action from state
  • Solution to recursive function:
    • Time-limited value:
      • Define to be the optimal value of if the game ends in more steps
        • compute the leaf nodes at the kth steps, then use it to find V* of it parent, and up
      • equivalent to limiting the depth to

Value iteration:

  • Init:
    • Starting with as we have no time left so no utility for every cells
    • except terminal states
  • Iterate:
    • Use to find best utility of nodes every other nodes
  • Repeat until convergent, which is
    • Complexity of each iteration:
    • Calculate for terms, each term calculate all actions for which each actions, we calculate the utility of every other state
  • theorem will converge to unique optimal values
  • Policy extraction:
    • After we have the optimal value , we need to know which action to take
      • do a small expectimax
      • action such that it maximize expected utility

Policy centric methods:

  • Policy Evaluation
  • Fixed policies:
    • We dont change the value of the policy
    • The tree would be simpler – only one action to take (according to the fixed policy) per state
      • … though the tree’s value would depend on which policy we fixed
    • Define the utility of a state , under a fixed policy : V^𝝿(s) = expected total discounted rewards starting in and following
    • per iteration
    • Idea 1: turn the recursive bellman equations into updates
      • Recursive, exponential cost
    • Idea 2: without the maxes, the bellman equations becomes linear system
      • Policy matrix * vector of
      • much simpler, to solve the matrix
  • Policy iteration:
    • combines policy evaluation and fixed policy
    • Step 1: policy evaluation: calculate utilities for some fixed policy (not optimal) until convergent
      • updating the value function also updates the policy
    • Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal) utilities as future values?
    • Repeat until policy converge

Linear Programming: