Description:

  • Agent can use the feedback it receives to iteratively update its policy while learning until eventually determining the optimal policy after sufficient exploration.
  • Estimate the values or q-values of states directly, without ever using any memory to construct a model of the rewards and transitions in the MDP.

Value Learning

Q-learning