Description:
- Maximize the sum of rewards possible as the process is unpredictable, with experience by interacting with the environment
- Online learning
- Receive feedback in the form of rewards
- Agent’s utility is defined by the reward function
- Must (learn to) act so as to maximize expected rewards
- All learning is based on observed samples of outcomes
- We dont know true transition probabilities and rewards function
Episode:
- A set of sample collected
Utilities of sequences:
- Rewards is given to agent at each step
- The reward sequence must be more as it comes closer to the goal (similar to heuristic)
- The reward must be early on rather than only the end, as it will have no incentive to move the right direction.
Discounting:
- Some actions must be taken now rather than later
- also help the algorothm to converge
- We reduce the reward of the taken by a discounting factor, γn where n is nb steps later it takes
Solution to infinite utilities:
- If the game can last forever, we can’t have typical reward system as the agent will go on forever
- Finite horizon:
- So we terminate episoles after a fixed T steps (e.g life)
- Similar to Depth Limiting Search
- Gives non-stationary policies π for them (π depends on the time left)
- Discounting
- U([r0,...,r∞])=t=0∑∞γtrt≤Rmax/(1−γ)
- Having smaller γ means smaller “horizon” - short term focus
- Absorbing state:
- Guarantee that for every policy, a terminal state will be eventually reached