Need to repeat same (s,a,s′,r) transitions in environment many times to propagate values
So collect transitions in a memory buffer and “replay” them to update Q-values
Use memory of transitions only, no need to repeat them in environment
Evidence of such experience replay in the brain
At each step:
receive a sample transition (s,a,s′,r)
add (s,a,s′,r) to replay buffer
Repeat n times:
randomly pick transition (s,a,s′,r) from replay buffer
Make sample based on (s,a,s′,r): sample=R(s,a,s′)+γmaxa′Q(s′,a′)
Update Q based on picked sample: Q(s,a)=(1−α)Q(s,a)+(α)[sample]
This ways, Q-learning also converge to optimal policy even acting suboptimally, off-policy learning
But:
need to explored enough
eventually make the learning rate small enough but not too quickly
Approximate Q-learning:
We can exploit the structure of the problem to reduce the nb of states
too many states and too many q-values
Generalization: only learn a subset of states then use it to infer in new, similar situations
Ex: pacman, has 4 symmetric corners, we can generalize to the situation of other corner from 1
Decode a state into a vector of features:
ex: distance to the closest ghosts, nb of ghosts, next to a wall?
This way, we can have Q-value as a linear value function:
Q(s,a)=w1f1(s,a)+w2f2(s,a)+...+wnfn(s,a)
Then all experiences can be summed up in a few numbers of a vector but if the vector is not designed correctly, different state with similar features but can hold very different in values
Q-learning with linear-Q-functions:
difference = [r+γmaxa′Q(s′,a′)]−Q(s,a)
Q(s,a)←Q(s,a)+α[difference], extract Q’s
wi←wi+α[difference]fi(s,a) , approximate the weight