Description:

  • Same as TD learning but we deal with Q-value
  • Sample-based Q-value iteration
  • Update as we go
  • Idea: learn values as exploring
    • receive a sample
    • consider your old estimate
    • consider your new esitmate:
    • Incorporate the new estimate into a running average:
  • Black magic: Q-learning will converge to optimal policy, event if acting suboptimally
    • this is off-policy learning
  • Warnings:
    • have to explore enough
    • eventually make the learning rate small enough
      • but not decreasing too quickly

Q-Value Iteration:

  • Find successive (depth-limited) values:
    • Start with , which we know is right
    • Given , calculate the depth values for all states

Q-Learning with replay buffer:

  • Need to repeat same transitions in environment many times to propagate values
  • So collect transitions in a memory buffer and “replay” them to update Q-values
    • Use memory of transitions only, no need to repeat them in environment
  • Evidence of such experience replay in the brain
  • At each step:
    • receive a sample transition
    • add to replay buffer
  • Repeat n times:
    • randomly pick transition from replay buffer
    • Make sample based on :
    • Update Q based on picked sample:
  • This ways, Q-learning also converge to optimal policy even acting suboptimally, off-policy learning
  • But:
    • need to explored enough
    • eventually make the learning rate small enough but not too quickly

Approximate Q-learning:

  • We can exploit the structure of the problem to reduce the nb of states
    • too many states and too many q-values
  • Generalization: only learn a subset of states then use it to infer in new, similar situations
  • Ex: pacman, has 4 symmetric corners, we can generalize to the situation of other corner from 1
  • Decode a state into a vector of features:
    • ex: distance to the closest ghosts, nb of ghosts, next to a wall?
  • This way, we can have Q-value as a linear value function:
  • Then all experiences can be summed up in a few numbers of a vector but if the vector is not designed correctly, different state with similar features but can hold very different in values
  • Q-learning with linear-Q-functions:
    • difference =
    • , extract Q’s
    • , approximate the weight
  • Use Least Square Data Fitting for finding weights
  • slide 50