UC Berkely’s CS285 (for now)
chapter 1: Imitation Learning / behavioral cloning
the supervised outlier in RL
: action
: state
: observation
the state is complete physical observation of the world
an observation is a snapshot / image of the state
transitioning from a state to another inherits the markov property
perfect training data distribution maybe harmful for imitation learning in a sense that the policy can easily go off-track on very narrow and deterministic paths can be easily in contrast of diluted training data distribution with mistakes and corrections by the expert in many diverse states, providing corrections of mistakes independently of the state will result to mistakes uncorrelated to the state while preserving the optimal action correlated to the state
continuous action distribution policy
mixture of gaussians
latent variable models
using conditional variational autoencoders
diffusion models
diffusing on the latent action representation dimension and not the temporal observation dimension
discretization with high dimensional action spaces
autoregressive descretization
generating action vector elements as a sequence
limitations & bottelnecks
non-markovian behaviour
multimodal behaviour: observing different actions based on the same observation
→ more expressive continuous distributions
vs
→ discretization with high dimensional action spaces