Description:
- Agent can use the feedback it receives to iteratively update its policy while learning until eventually determining the optimal policy after sufficient exploration.
- Estimate the values or q-values of states directly, without ever using any memory to construct a model of the rewards and transitions in the MDP.