Q-learning

Same as TD learning but we deal with Q-value
Sample-based Q-value iteration
Update $Q$ as we go
Idea: learn $Q (s, a)$ values as exploring
- receive a sample $(s, a, s^{'}, r)$
- consider your old estimate
- consider your new esitmate: $Q (s, a)$
  - $sample = R (s, a, s^{'}) + γ max_{a^{'}} Q (s^{'}, a^{'})$
- Incorporate the new estimate into a running average:
  - $Q (s, a) \leftarrow (1 - α) Q (s, a) + (α) [sample]$
Black magic: Q-learning will converge to optimal policy, event if acting suboptimally
- this is off-policy learning
Warnings:
- have to explore enough
- eventually make the learning rate small enough
  - but not decreasing too quickly

Find successive (depth-limited) values:
- Start with $V_{0} (s) = 0$ , which we know is right
- Given $V_{k}$ , calculate the depth $k + 1$ values for all states
  - $Q_{k + 1} (s, a) \leftarrow s^{'} \sum T (s, a, s^{'}) [R (s, a, s^{'}) + γ a^{'} max Q_{k} (s^{'}, a^{'})]$

Need to repeat same $(s, a, s^{'}, r)$ transitions in environment many times to propagate values
So collect transitions in a memory buffer and “replay” them to update Q-values
- Use memory of transitions only, no need to repeat them in environment
Evidence of such experience replay in the brain
At each step:
- receive a sample transition $(s, a, s^{'}, r)$
- add $(s, a, s^{'}, r)$ to replay buffer
Repeat n times:
- randomly pick transition $(s, a, s^{'}, r)$ from replay buffer
- Make sample based on $(s, a, s^{'}, r)$ : $s am pl e = R (s, a, s^{'}) + γ max_{a^{'}} Q (s^{'}, a^{'})$
- Update Q based on picked sample: $Q (s, a) = (1 - α) Q (s, a) + (α) [s am pl e]$
This ways, Q-learning also converge to optimal policy even acting suboptimally, off-policy learning
But:
- need to explored enough
- eventually make the learning rate small enough but not too quickly

We can exploit the structure of the problem to reduce the nb of states
- too many states and too many q-values
Generalization: only learn a subset of states then use it to infer in new, similar situations
Ex: pacman, has 4 symmetric corners, we can generalize to the situation of other corner from 1
Decode a state into a vector of features:
- ex: distance to the closest ghosts, nb of ghosts, next to a wall?
This way, we can have Q-value as a linear value function:
- $Q (s, a) = w_{1} f_{1} (s, a) + w_{2} f_{2} (s, a) + ... + w_{n} f_{n} (s, a)$
Then all experiences can be summed up in a few numbers of a vector but if the vector is not designed correctly, different state with similar features but can hold very different in values
Q-learning with linear-Q-functions:
- difference = $[r + γ max_{a^{'}} Q (s^{'}, a^{'})] - Q (s, a)$
- $Q (s, a) \leftarrow Q (s, a) + α [difference]$ , extract Q’s
- $w_{i} \leftarrow w_{i} + α [difference] f_{i} (s, a)$ , approximate the weight
Use Least Square Data Fitting for finding weights

StrixTheKiet Notes