Description:

  • Goal: Compute values for each state under a given policy, 𝛑 with a sample
  • Idea: Average together observed sample values
    • Act according to 𝛑
    • Every time you visit a state, write down what the sum of discounted rewards turned out to be from that state to the end:
      • no matter the process in the middle
    • Average those samples:
      • to have the estimate value of all the states
  • Good:
    • easy
    • doesnt need to know and
    • Eventually compute the correct average values with just sample transitions
  • Bad:
    • waste information about state connection as states value are not consistent
    • Each state must be computed separately takes long time to learn
  • We cant use Policy Evaluation because it dont have and