Learning Reinforcement Learning
Get Notion free

Learning Reinforcement Learning

UC Berkely’s CS285, ETH zurich’s Robot learning course and beyond

Imitation Learning / behavioral cloning

the supervised outlier in RL
ata_t: action
sts_t: state
oto_t : observation
the state is complete physical observation of the world
an observation is a snapshot / image of the state
transitioning from a state to another stst+1s_t \rightarrow s_{t+1} inherits the markov property
perfect training data distribution maybe harmful for imitation learning in a sense that the policy can easily go off-track on very narrow and deterministic paths can be easily in contrast of diluted training data distribution with mistakes and corrections by the expert in many diverse states, providing corrections of mistakes independently of the state will result to mistakes uncorrelated to the state while preserving the optimal action correlated to the state

continuous action distribution policy

mixture of gaussians
πθ(ao)=iwiN(μi,σi)\pi_\theta(a|o) = \displaystyle \sum_i w_i \mathcal{N}(\mu_i, \sigma_i)
latent variable models
using conditional variational autoencoders
diffusion models
diffusing on the latent action representation dimension and not the temporal observation dimension

discretization with high dimensional action spaces

autoregressive descretization
generating action vector elements as a sequence

limitations & bottelnecks

non-markovian behaviour
multimodal behaviour: observing different actions based on the same observation
→ more expressive continuous distributions
vs
→ discretization with high dimensional action spacesa

Actor Critic

value function estimator: train on {si,t,t=tTr(si,t,ai,t)}iD\left\{s_{i, t}, \sum_{t^\prime=t}^\Tau r(s_{i, t^\prime}, a_{i, t^\prime}) \right\}_{i \in D}
reward bootstrap estimator: {si,t,r(si,t,ai,t)+V^ϕπ(si,t+1)}iD\left \{s_{i,t}, r(s_{i, t}, a_{i, t})+ \hat{\mathrm{V}}^\pi_{\phi}(s_{i, t+1}) \right \}_{i \in D}
actor critic algorithm:
sample {si,ai}\{s_i, a_i\} using πθ(as)\pi_\theta(a|s)
fit V^ϕπ\hat{\mathrm{V}}^\pi_\phi to sampled reward / bootstrap
evaluate A^ϕπ(si,ai)=r(si,ai)+γV^ϕπ(si)V^ϕπ(si)\hat{\mathrm{A}}^\pi_\phi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{\mathrm{V}}^\pi_\phi(s^\prime_i) - \hat{\mathrm{V}}^\pi_\phi(s_i)
θJ(θ)1Niθlogπθ(aisi)A^ϕπ(si,ai)\nabla_\theta\mathcal{J}(\theta) \approx \frac{1}{N}\sum_{i} \nabla_\theta\log\pi_\theta(a_{i}|s_{i})\, \hat{\mathrm{A}}^\pi_\phi(s_i, a_i)
θθ+αθJ(θ)\theta \leftarrow \theta + \alpha\,\nabla_\theta\mathcal{J}(\theta)

Discount factor

applying an exponential decay over time to rewards
policy-gradient :t=tTγtr(st,at)actor-critic A^ϕπ(si,ai)=r(si,ai)+γV^ϕπ(si)V^ϕπ(si)\text{policy-gradient \,:} \quad \sum_{t^\prime = t}^\Tau \gamma^t r(s_{t^\prime} , a_{t^\prime} ) \\ \text{actor-critic \quad\quad: }\quad \hat{\mathrm{A}}^\pi_\phi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{\mathrm{V}}^\pi_\phi(s^\prime_i) - \hat{\mathrm{V}}^\pi_\phi(s_i)
equation (1) (which matches the critic) applies the discount factor starting from the current action step
equation (2) applies the discount factor starting from the initial step
equation (3) is same as equation (2) and additionally discounts the gradient (decaying the importance of future actions / steps)

bias-variance: grad weighting

in gradient based algorithms the gradient is set of state-conditioned actions taken on following a batch of sampled trajectories using the policy, each action log prob weighted by some value (could be reward-to-go, estimated value function, advantage, discounted advantage, …)
the weight factor is what determin how more/less likely that state-conditioned action to be taken
baseline: using rewards as the weight using a monte-carlo estimator is compute expensive (requires interacting with the environment) which is usually overcome by using less samples (commonly a single one) ⇒ leading to a variance problem (too few samples)
critic: using a learned / estimated weight (fitted function, actor critiic, …) makes reducing variance easier (sampling from a learned function is much cheaper) ⇒ introduces bias (estimator error) to the equation

best-of-both-worlds solutions

critic as a baseline:
n-step returns:
🤖 Callout icon

RL algorithms typography

policy-gradient methods: gradient-based optimization of the expectation of reward over policy-sampled trajectory
value-based methods: estimate / learn V- or Q- function
actor critic: learn V- or Q- function and use it to optimize the policy (e.g better θ\nabla_\theta)
model-based methods: learn transition model + improve the policy

Policy Gradient

sequence modeling, !! optimizing the expected reward !!
Q-function, value-function and advantage
policy-gradient gradient derivation
effectively:
sample trajectories following the current policy
up-/down- weight the log-prob of the trajectory using it’s expected reward

Off-Policy Policy Gradient

updating the current policy using previous policy actions / transactions suing importance sampling