Direct Evaluation

Goal: Compute values for each state under a given policy, 𝛑 with a sample
Idea: Average together observed sample values
- Act according to 𝛑
- Every time you visit a state, write down what the sum of discounted rewards turned out to be from that state to the end: $sample_{i} (s) = t \sum γ^{t} R^{t}$
  - no matter the process in the middle
- Average those samples: $V (s) \approx \frac{1}{N} t \sum sample_{i} (s)$
  - to have the estimate value of all the states
Good:
- easy
- doesnt need to know $T$ and $R$
- Eventually compute the correct average values with just sample transitions
Bad:
- waste information about state connection as states value are not consistent
- Each state must be computed separately → takes long time to learn
We cant use Policy Evaluation because it dont have $T$ and $R$

StrixTheKiet Notes