StrixTheKiet Notes

Search

❯

❯

ArtificialIntelligence

❯

❯

Reinforcement Learning

Reinforcement Learning

Mar 20, 20252 min read

Description:

Maximize the sum of rewards possible as the process is unpredictable, with experience by interacting with the environment
Online learning
- Receive feedback in the form of rewards
- Agent’s utility is defined by the reward function
- Must (learn to) act so as to maximize expected rewards
- All learning is based on observed samples of outcomes
We dont know true transition probabilities and rewards function

Episode:

A set of sample collected

Utilities of sequences:

Rewards is given to agent at each step
The reward sequence must be more as it comes closer to the goal (similar to heuristic)
The reward must be early on rather than only the end, as it will have no incentive to move the right direction.

Discounting:

Some actions must be taken now rather than later
- also help the algorothm to converge
We reduce the reward of the taken by a discounting factor, $γ^{n}$ where $n$ is nb steps later it takes

Solution to infinite utilities:

If the game can last forever, we can’t have typical reward system as the agent will go on forever
Finite horizon:
- So we terminate episoles after a fixed T steps (e.g life)
- Similar to Depth Limiting Search
- Gives non-stationary policies $π$ for them ( $π$ depends on the time left)
Discounting
- $U ([r_{0}, ..., r_{\infty}]) = t = 0 \sum \infty γ^{t} r_{t} \leq R_{ma x} / (1 - γ)$
- Having smaller $γ$ means smaller “horizon” - short term focus
Absorbing state:
- Guarantee that for every policy, a terminal state will be eventually reached

Passive Reinforcement Learning

Active Reinforcement Learning

Graph View

Description:
Episode:
Utilities of sequences:
Discounting:
Solution to infinite utilities:
Passive Reinforcement Learning
Active Reinforcement Learning

Backlinks

Artificial Intelligence
Machine Learning

Created with strixthekiet

GitHub
Email