Description:

  • A model-free way of doing Policy Evaluation, mimicking Bellman updates with running sample averages
  • Idea: learn from every experience
    • Update each time we experience a transition
    • Likely outcome will contribute updates more often than the other
  • TD learning tries to answer the question of how to compute this weighted average without the weights, cleverly doing so with Exponential Moving Average in RL
  • Problem:
    • If we want to turn values into a new policy, we are stuck as we dont have and so no
  • Solution:
    • Learn Q values instead, Q-learning
    • Make action selection also model-free

Temporal difference of value:

  • Policy still fixed, still doing evaluation
  • Move values toward value of whatever successor occurs: running average
  • Sample of
  • Update to
  • Same update: