Reinforcement Learning Notes
Get Notion free

Reinforcement Learning Notes

UC Berkely’s CS285 (for now)

chapter 1: Imitation Learning / behavioral cloning

the supervised outlier in RL
ata_t: action
sts_t: state
oto_t : observation
the state is complete physical observation of the world
an observation is a snapshot / image of the state
transitioning from a state to another stst+1s_t \rightarrow s_{t+1} inherits the markov property
perfect training data distribution maybe harmful for imitation learning in a sense that the policy can easily go off-track on very narrow and deterministic paths can be easily in contrast of diluted training data distribution with mistakes and corrections by the expert in many diverse states, providing corrections of mistakes independently of the state will result to mistakes uncorrelated to the state while preserving the optimal action correlated to the state

continuous action distribution policy

mixture of gaussians
πθ(ao)=iwiN(μi,σi)\pi_\theta(a|o) = \displaystyle \sum_i w_i \mathcal{N}(\mu_i, \sigma_i)
latent variable models
using conditional variational autoencoders
diffusion models
diffusing on the latent action representation dimension and not the temporal observation dimension

discretization with high dimensional action spaces

autoregressive descretization
generating action vector elements as a sequence

limitations & bottelnecks

non-markovian behaviour
multimodal behaviour: observing different actions based on the same observation
→ more expressive continuous distributions
vs
→ discretization with high dimensional action spaces