Description:
- A MDP defined with:
- A set of states s∈S
- A set of actions a∈A
- A Transition Function T(s,a,s’)
- Probability that a from s leads to s’, i.e., P(s’∣s,a)
- Also called the model or the dynamics
- A reward function R(s,a,s’)
- Sometimes just R(s) or R(s’)
- A start state
- Maybe a terminal state
- Usually denoted as tuple 〈S, A, T, R, 𝛄〉
- Only work if we know all the reward functions and transitions
- Better than alpha/beta, expectimax as it eliminates the loop
- So you want to:
- Compute optimal values: use value iteration or policy iteration
- Compute values for a particular policy: use policy evaluation
- Turn your values into a policy: use policy extraction (one-step lookahead)
Policies:
- For MDPs, we want an optimal policy 𝛑*: S → A
- A policy p gives an action for each state as Markov Chain
- An optimal policy is one that maximizes expected utility if followed
- An explicit policy defines a reflex agent
- Expectimax doesn’t compute entire policies, it computes the action for a single state only
- as states can be infite and non-recurrent, it is cant be represented with Markov Chain
Optimal quantities:
- V∗(s)=maxaQ∗(s,a)
- utility (value) of state s, meaning starting in s and acting optimally
- Max of expected utility for action a
- Q∗(s,a)=s′∑T(s,a,s′)[R(s,a,s′)+γV∗(s′)]
- expected utility starting out having taken action a from state s and (thereafter) acting optimally
- Q∗= Transition Function*(reward of s′ coming from s + γ * value of s′ )
- Note that transition funciton is a probablity
- Then V∗(s)=maxa∑s′T(s,a,s′)[R(s,a,s′)+γV∗(s′)]
- π∗(s): optimal action from state s
- Solution to recursive function:
- Time-limited value:
- Define Vk(s) to be the optimal value of s if the game ends in k more steps
- compute the leaf nodes at the kth steps, then use it to find V* of it parent, and up
- equivalent to limiting the depth to k
Value iteration:
- Init: ∀s: V(s)=0
- Starting with V0(s)=0 as we have no time left so no utility for every cells
- except terminal states
- Iterate:
- ∀s:Vnew(s)=maxa∑s′T(s,a,s′)[R(s,a,s′)+γV(s′)]
- V=Vnew
- Use V∗(s)=amaxs′∑T(s,a,s′)[R(s,a,s′)+γV∗(s′)] to find best utility of nodes every other nodes
- Repeat until convergent, which is V∗
- Complexity of each iteration: O(S2A)
- Calculate Vk+1 for S terms, each term calculate all actions for which each actions, we calculate the utility of every other state
- theorem will converge to unique optimal values
- Policy extraction:
- After we have the optimal value V∗(s), we need to know which action to take
- π∗(s)=argmaxaQ∗(s,a)=argmaxa∑s′T(s,a,s′)[R(s,a,s′)+γV∗(s′)]
- do a small expectimax
- action such that it maximize expected utility
Policy centric methods:
- Policy Evaluation
- Fixed policies:
- We dont change the value of the policy
- The tree would be simpler – only one action to take (according to the fixed policy) per state
- … though the tree’s value would depend on which policy we fixed
- Define the utility of a state s, under a fixed policy π: V^𝝿(s) = expected total discounted rewards starting in s and following π
- Vπ(s)=∑s′T(s,π(s),s′)[R(s,π(s),s′)+γVπ(s′)]
- O(S2) per iteration
- Idea 1: turn the recursive bellman equations into updates
- V0π(s)=0
- Vk+1π(s)=∑s′T(s,π(s),s′)[R(s,π(s),s′)+γVkπ(s′)]
- Recursive, exponential cost
- Idea 2: without the maxes, the bellman equations becomes linear system
- Policy matrix * vector of Vπ(s1)
- much simpler, O(s3) to solve the matrix
- Policy iteration:
- combines policy evaluation and fixed policy
- Step 1: policy evaluation: calculate utilities for some fixed policy (not optimal) until convergent
- Vk+1π(s)=s′∑T(s,πi(s),s′)[R(s,πi(s),s′)+γVkπi(s′)]
- updating the value function Vπi also updates the policy
- Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal) utilities as future values?
- πi+1(s)=argamaxs′∑T(s,a,s′)[R(s,a,s′)+γVπi(s′)]
- Repeat until policy converge
Linear Programming: