Description:
- Goal: Compute values for each state under a given policy, 𝛑 with a sample
- Idea: Average together observed sample values
- Act according to 𝛑
- Every time you visit a state, write down what the sum of discounted rewards turned out to be from that state to the end: samplei(s)=t∑γtRt
- no matter the process in the middle
- Average those samples: V(s)≈N1t∑samplei(s)
- to have the estimate value of all the states
- Good:
- easy
- doesnt need to know T and R
- Eventually compute the correct average values with just sample transitions
- Bad:
- waste information about state connection as states value are not consistent
- Each state must be computed separately → takes long time to learn
- We cant use Policy Evaluation because it dont have T and R