Learning Reinforcement Learning

Learning Reinforcement Learning

UC Berkely’s CS285, ETH zurich’s Robot learning course and beyond 
Imitation Learning / behavioral cloning
the supervised outlier in RL
ata_tat​﻿: action
sts_tst​﻿: state
oto_t ot​﻿: observation
πθ(at,ot):Policyπθ(at,st):Policy (fully observed)\mathcal{\pi}_\theta(a_t, o_t) : \text{Policy} \\
\mathcal{\pi}_\theta(a_t, s_t) : \text{Policy (fully observed)}πθ​(at​,ot​):Policyπθ​(at​,st​):Policy (fully observed)
the state is complete physical observation of the world
an observation is a snapshot / image of the state
transitioning from a state to another st→st+1s_t \rightarrow s_{t+1}st​→st+1​﻿ inherits the markov property
perfect training data distribution maybe harmful for imitation learning in a sense that the policy can easily go off-track on very narrow and deterministic paths can be easily in contrast of diluted training data distribution with mistakes and corrections by the expert in many diverse states, providing corrections of mistakes independently of the state will result to mistakes uncorrelated to the state while preserving the optimal action correlated to the state 
continuous action distribution policy
mixture of gaussians
πθ(a∣o)=∑iwiN(μi,σi)\pi_\theta(a|o) = \displaystyle \sum_i w_i \mathcal{N}(\mu_i, \sigma_i)πθ​(a∣o)=i∑​wi​N(μi​,σi​)﻿​
latent variable models
using conditional variational autoencoders
diffusion models
diffusing on the latent action representation dimension and not the temporal observation dimension
discretization with high dimensional action spaces
autoregressive descretization
generating action vector elements as a sequence
limitations & bottelnecks
non-markovian behaviour
multimodal behaviour: observing different actions based on the same observation
→ more expressive continuous distributions
              vs
→ discretization with high dimensional action spacesa
Actor Critic
value function estimator: train on {si,t,∑t′=tTr(si,t′,ai,t′)}i∈D\left\{s_{i, t}, \sum_{t^\prime=t}^\Tau r(s_{i, t^\prime}, a_{i, t^\prime}) \right\}_{i \in D}{si,t​,∑t′=tT​r(si,t′​,ai,t′​)}i∈D​﻿​
reward bootstrap estimator: {si,t,r(si,t,ai,t)+V^ϕπ(si,t+1)}i∈D\left \{s_{i,t}, r(s_{i, t}, a_{i, t})+ \hat{\mathrm{V}}^\pi_{\phi}(s_{i, t+1}) \right \}_{i \in D}{si,t​,r(si,t​,ai,t​)+V^ϕπ​(si,t+1​)}i∈D​﻿​
actor critic algorithm:
sample {si,ai}\{s_i, a_i\}{si​,ai​}﻿ using πθ(a∣s)\pi_\theta(a|s)πθ​(a∣s)﻿​
fit V^ϕπ\hat{\mathrm{V}}^\pi_\phiV^ϕπ​﻿ to sampled reward / bootstrap
evaluate A^ϕπ(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)\hat{\mathrm{A}}^\pi_\phi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{\mathrm{V}}^\pi_\phi(s^\prime_i) - \hat{\mathrm{V}}^\pi_\phi(s_i)A^ϕπ​(si​,ai​)=r(si​,ai​)+γV^ϕπ​(si′​)−V^ϕπ​(si​)﻿​
∇θJ(θ)≈1N∑i∇θlog⁡πθ(ai∣si) A^ϕπ(si,ai)\nabla_\theta\mathcal{J}(\theta) \approx \frac{1}{N}\sum_{i} \nabla_\theta\log\pi_\theta(a_{i}|s_{i})\, \hat{\mathrm{A}}^\pi_\phi(s_i, a_i)∇θ​J(θ)≈N1​∑i​∇θ​logπθ​(ai​∣si​)A^ϕπ​(si​,ai​)﻿​
θ←θ+α ∇θJ(θ)\theta \leftarrow \theta + \alpha\,\nabla_\theta\mathcal{J}(\theta)θ←θ+α∇θ​J(θ)﻿​
Discount factor
applying an exponential decay over time to rewards
policy-gradient  :∑t′=tTγtr(st′,at′)actor-critic : A^ϕπ(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)\text{policy-gradient  \,:} \quad \sum_{t^\prime = t}^\Tau \gamma^t r(s_{t^\prime}
, a_{t^\prime}
) \\
\text{actor-critic  \quad\quad: }\quad \hat{\mathrm{A}}^\pi_\phi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{\mathrm{V}}^\pi_\phi(s^\prime_i) - \hat{\mathrm{V}}^\pi_\phi(s_i)policy-gradient :∑t′=tT​γtr(st′​,at′​)actor-critic : A^ϕπ​(si​,ai​)=r(si​,ai​)+γV^ϕπ​(si′​)−V^ϕπ​(si​)﻿​
∇θJ(θ)=ET∼pθ(T)[∑t=1T∇θlog⁡πθ(at∣st) . (∑t′=tTγt′−tr(st′,at′)](1)∇θJ(θ)=ET∼pθ(T)[∑t=1T∇θlog⁡πθ(at∣st) . (∑t′=tTγt′−1r(st′,at′)](2)∇θJ(θ)=ET∼pθ(T)[∑t=1Tγt′−1∇θlog⁡πθ(at∣st) . (∑t′=tTγt′−1r(st′,at′)](3)\nabla_\theta \mathcal{J}(\theta)= \mathbb{E}_{\Tau \sim \mathcal{p}_\theta(\Tau)}
\left[
\sum_{t=1}^\Tau
\nabla_\theta \log \pi_\theta(a_t|s_t)\, . \,\left(\small\sum_{t^\prime = t}^\Tau \gamma^{t^\prime - t}r(s_{t^\prime}
, a_{t^\prime}\right) \right](1)


\\


\nabla_\theta \mathcal{J}(\theta) =\mathbb{E}_{\Tau \sim \mathcal{p}_\theta(\Tau)}
\left[
\sum_{t=1}^\Tau
\nabla_\theta \log \pi_\theta(a_t | s_t)\, . \,\left(\small\sum_{t^\prime = t}^\Tau \gamma^{t^\prime - 1}r(s_{t^\prime}
, a_{t^\prime}\right)\right]
(2)
\\

\nabla_\theta \mathcal{J}(\theta) =\mathbb{E}_{\Tau \sim \mathcal{p}_\theta(\Tau)}
\left[
\sum_{t=1}^\Tau
\gamma^{t^\prime - 1}\nabla_\theta \log \pi_\theta(a_t | s_t)\, . \,\left(\small\sum_{t^\prime = t}^\Tau \gamma^{t^\prime - 1}r(s_{t^\prime}
, a_{t^\prime}\right)\right]
(3)∇θ​J(θ)=ET∼pθ​(T)​[t=1∑T​∇θ​logπθ​(at​∣st​).(t′=t∑T​γt′−tr(st′​,at′​)](1)∇θ​J(θ)=ET∼pθ​(T)​[t=1∑T​∇θ​logπθ​(at​∣st​).(t′=t∑T​γt′−1r(st′​,at′​)](2)∇θ​J(θ)=ET∼pθ​(T)​[t=1∑T​γt′−1∇θ​logπθ​(at​∣st​).(t′=t∑T​γt′−1r(st′​,at′​)](3)
equation (1) (which matches the critic) applies the discount factor starting from the current action step
equation (2) applies the discount factor starting from the initial step
equation (3) is same as equation (2) and additionally discounts the gradient (decaying the importance of future actions / steps)
bias-variance: grad weighting
in gradient based algorithms the gradient is set of state-conditioned actions taken on following a batch of sampled trajectories using the policy, each action  log prob weighted by some value (could be reward-to-go, estimated value function, advantage, discounted advantage, …)
∇θJ(θ)=∑i∑t(∇θlog⁡πθ(a∣s)×weight)\nabla_\theta\mathcal{J}(\theta) = \sum_i\sum_t \left(\nabla_\theta\log\pi_\theta(a|s) \times \mathrm{weight} \right)∇θ​J(θ)=i∑​t∑​(∇θ​logπθ​(a∣s)×weight)
the weight factor is what determin how more/less likely that state-conditioned action to be taken
baseline: using rewards as the weight using a monte-carlo estimator is compute expensive (requires interacting with the environment) which is usually overcome by using less samples (commonly a single one)
⇒ leading to a variance problem (too few samples)
critic: using a learned / estimated weight (fitted function, actor critiic, …) makes reducing variance easier (sampling from a learned function is much cheaper) 
⇒ introduces bias (estimator error) to the equation
best-of-both-worlds solutions
critic as a baseline:
ET∼pθ(T)[∑t=1T∇θlog⁡πθ(at∣st) . ((∑t′=tTγt′−tr(st′,at′)−V^ϕπ(st))]\mathbb{E}_{\Tau \sim \mathcal{p}_\theta(\Tau)}
\left[
\sum_{t=1}^\Tau
\nabla_\theta \log \pi_\theta(a_t|s_t)\, . \,\left(\left(\small\sum_{t^\prime = t}^\Tau \gamma^{t^\prime - t}r(s_{t^\prime}
, a_{t^\prime}\right) - \hat{\mathrm{V}}^\pi_\phi(s_t)\right)\right]ET∼pθ​(T)​[t=1∑T​∇θ​logπθ​(at​∣st​).((t′=t∑T​γt′−tr(st′​,at′​)−V^ϕπ​(st​))]
 
n-step returns:
ACπ^=r(st,at)+γVϕπ^(st=1)−Vϕπ^(st)AMCπ^=∑t′=t∞γt′−t r(st,at)−V^ϕπ(st)Anπ^=∑t′=tt+nγt′−t r(st,at)+γnV^ϕπ(st+n)−V^ϕπ(st)AGAEπ^=∑t′=t∞wn Anπ^(st,at),wn∝λn−1\begin{align*}&\hat{\mathrm{A}^\pi_C} = r(s_t, a_t) + \gamma \hat{\mathrm{V}^\pi_\phi}(s_{t=1}) - \hat{\mathrm{V}^\pi_\phi}(s_t) \\
&\hat{\mathrm{A}^\pi_{MC}} = \sum_{t^\prime=t}^\infty \gamma^{t^\prime - t}\,r(s_t, a_t) - \hat{\mathrm{V}}^\pi_\phi(s_t)\\

&\hat{\mathrm{A}^\pi_{n}} = \sum_{t^\prime=t}^{t+n} \gamma^{t^\prime - t}\,r(s_t, a_t) + \gamma^{n}\hat{\mathrm{V}}^\pi_\phi(s_{t+n})- \hat{\mathrm{V}}^\pi_\phi(s_t)\\



&\hat{\mathrm{A}^\pi_\mathrm{GAE}} = \sum_{t^\prime=t}^{\infty} w_{n}\,\hat{\mathrm{A}^\pi_{n}}(s_t, a_t), \quad w_n\propto \lambda^{n-1}\\

\end{align*}​ACπ​^​=r(st​,at​)+γVϕπ​^​(st=1​)−Vϕπ​^​(st​)AMCπ​^​=t′=t∑∞​γt′−tr(st​,at​)−V^ϕπ​(st​)Anπ​^​=t′=t∑t+n​γt′−tr(st​,at​)+γnV^ϕπ​(st+n​)−V^ϕπ​(st​)AGAEπ​^​=t′=t∑∞​wn​Anπ​^​(st​,at​),wn​∝λn−1​
RL algorithms typography
policy-gradient methods: gradient-based optimization of the expectation of reward over policy-sampled trajectory
value-based methods: estimate / learn V- or Q- function
actor critic: learn V- or Q- function and use it to optimize the policy (e.g better ∇θ\nabla_\theta∇θ​﻿)
model-based methods: learn transition model + improve the policy
Policy Gradient
sequence modeling, !! optimizing the expected reward !!
Pθ(T)=Pθ(s1,a1,…,sT,aT)=P(s1).∏t=1Tπθ(at∣st).P(st+1∣st,at)θ∗=arg max⁡θET∼Pθ(T)[∑tr(at,st)]=arg max⁡θ∑t=1TE(st,at)∼Pθ(st,at)[∑t=1Tr(st,at)]\mathcal{P}_\theta(\Tau) = \mathcal{P}_\theta(s_1, a_1, \dots, s_T, a_T) = \mathcal{P}(s_1) . \prod^\Tau_{t=1}\pi_\theta(a_t|s_t) . \mathcal{P}(s_{t+1}|s_t, a_t) \\
\begin{align*} \theta^* &= \argmax_\theta \mathbb{E}_{\Tau \sim \mathcal{P}_\theta(\Tau)}\large{[}\small\sum_t r(a_t, s_t)\large{]} \\
&= \argmax_\theta \sum_{t=1}^\Tau \mathbb{E}_{(s_t, a_t) \sim \mathcal{P}_\theta(s_t, a_t)} \left[ \sum_{t=1}^\Tau r(s_t, a_t) \right]

\end{align*}Pθ​(T)=Pθ​(s1​,a1​,…,sT​,aT​)=P(s1​).t=1∏T​πθ​(at​∣st​).P(st+1​∣st​,at​)θ∗​=θargmax​ET∼Pθ​(T)​[t∑​r(at​,st​)]=θargmax​t=1∑T​E(st​,at​)∼Pθ​(st​,at​)​[t=1∑T​r(st​,at​)]​
Q-function, value-function and advantage
Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st′,at′]Vπ(st)=Eat∼πθ[Qπ(st,at)]Aπ(st,at)=Qπ(st,at)−Vπ(st)\mathrm{Q}^\pi(s_t, a_t) = \sum^\Tau_{t^\prime=t}\mathbb{E}_{\pi_\theta}\large{[}r(s_{t^\prime}, a_{t^\prime})|s_{t^\prime}, a_{t^\prime}\large{]}
\\
\mathrm{V}^\pi(s_t) = \mathbb{E}_{a_t \sim \pi_\theta}\large{[}\mathrm{Q^\pi(s_t, a_t)}\large{]} \\
\mathrm{A}^\pi(s_t, a_t) = \mathrm{Q}^\pi(s_t, a_t) - \mathrm{V}^\pi(s_t)Qπ(st​,at​)=t′=t∑T​Eπθ​​[r(st′​,at′​)∣st′​,at′​]Vπ(st​)=Eat​∼πθ​​[Qπ(st​,at​)]Aπ(st​,at​)=Qπ(st​,at​)−Vπ(st​)
policy-gradient gradient derivation
J(θ)=ET∼pθ(T)[∑tr(st,at)]=∫pθ(T) r(T) dT∇θJ(θ)=∫∇θpθ(T) r(T) dT,∇xf(x)=f(x)∇xlog⁡(f(x))=∫pθ(T) ∇θlog⁡pθ(T) r(T) dT=ET∼pθ(T)[(∑t=1T∇θlog⁡πθ(at∣st))(∑tr(st,at))]=ET∼pθ(T)[∑t=1T∇θlog⁡πθ(at∣st) . Q^t]: causalityQ^t=∑t′=tTr(st′,at′):reward-to-go\begin{align*}
\mathcal{J}(\theta)
&= \mathbb{E}_{\Tau \sim \mathcal{p}_\theta(\Tau)}
\left[ \sum_t r(s_t, a_t) \right] \\

&= \int \mathcal{p}_\theta(\Tau)\, r(\Tau)\, d\Tau \\

\nabla_\theta \mathcal{J}(\theta)
&= \int \nabla_\theta \mathcal{p}_\theta(\Tau)\, r(\Tau)\, d\Tau ,\quad \nabla_x f(x) = f(x)\nabla_x \log(f(x))\\

&= \int \mathcal{p}_\theta(\Tau)\, \nabla_\theta \log \mathcal{p}_\theta(\Tau)\, r(\Tau)\, d\Tau \\

&= \mathbb{E}_{\Tau \sim \mathcal{p}_\theta(\Tau)}
\left[
\left( \sum_{t=1}^\Tau
\nabla_\theta \log \pi_\theta(a_t \mid s_t) \right)
\left( \sum_t r(s_t, a_t) \right)
\right] \\

&= \mathbb{E}_{\Tau \sim \mathcal{p}_\theta(\Tau)}
\left[
\sum_{t=1}^\Tau
\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, . \,\hat{Q}_t
\right] \quad \text{: causality}
\end{align*}

\\\hat{\mathrm{Q}}_t = \sum_{t^\prime = t}^\Tau r(s_{t^\prime}
, a_{t^\prime}
) : \text{reward-to-go}
J(θ)∇θ​J(θ)​=ET∼pθ​(T)​[t∑​r(st​,at​)]=∫pθ​(T)r(T)dT=∫∇θ​pθ​(T)r(T)dT,∇x​f(x)=f(x)∇x​log(f(x))=∫pθ​(T)∇θ​logpθ​(T)r(T)dT=ET∼pθ​(T)​[(t=1∑T​∇θ​logπθ​(at​∣st​))(t∑​r(st​,at​))]=ET∼pθ​(T)​[t=1∑T​∇θ​logπθ​(at​∣st​).Q^​t​]: causality​Q^​t​=t′=t∑T​r(st′​,at′​):reward-to-go
effectively:
sample trajectories following the current policy
up-/down- weight the log-prob of the trajectory using it’s expected reward
θ←θ+α ∇θJ(θ):REINFORCE algorithm
\theta \leftarrow \theta + \alpha \,\nabla_\theta\mathcal{J}(\theta): \quad \text{REINFORCE algorithm}θ←θ+α∇θ​J(θ):REINFORCE algorithm
Off-Policy Policy Gradient
updating the current policy using previous policy actions / transactions suing importance sampling
J(θ′)=ET∼Pθ′(T)[Pθ′(T) r(T)]=ET∼Pθ(T)[. r(T)]∇θ′J(θ′)=ET∼Pθ′(T)[Pθ′(T) ∇θ′log⁡Pθ′(T) r(T)]=ET∼Pθ(T)[Pθ′(T)Pθ(T)∇θlog⁡Pθ′(T) r(T)]=ET∼Pθ(T)[(∏t=1Tπθ′(at∣st)πθ(at∣st)) .(∑t=1T∇θlog⁡πθ′(T)). r(T)]≈ET∼Pθ(T)[∑t=1T∇θlog⁡πθ′(T) . πθ′(at∣st)πθ(at∣st) . Q^t]\begin{align*}

\mathcal{J}(\theta^\prime) &= \mathbb{E}_{\Tau \sim \mathcal{P}_{\theta^\prime}(\Tau)}\left[\mathcal{P}_{\theta^\prime}(\Tau) \, r(\Tau)\right]
\\
&= \mathbb{E}_{\Tau \sim \mathcal{P}_\theta(\Tau)} \left[ . \, r(\Tau)\right] \\


\end{align*}
\\
\begin{align*}
\nabla_{\theta^\prime}\mathcal{J}(\theta^\prime) &= \mathbb{E}_{\Tau \sim \mathcal{P}_{\theta^\prime}(\Tau)} \left[ \mathcal{P}_{\theta^\prime}(\Tau) \, \nabla_{\theta^\prime} \log\mathcal{P}_{\theta^\prime}(\Tau) \, r(\Tau)\right] \\

&= \mathbb{E}_{\Tau \sim \mathcal{P}_{\theta}(\Tau)} \left[ \frac{\mathcal{P}_{\theta^\prime}(\Tau)}{\mathcal{P}_\theta(\Tau)}\nabla_{\theta} \log\mathcal{P}_{\theta^\prime}(\Tau) \, r(\Tau)\right]

\\
&= \mathbb{E}_{\Tau \sim \mathcal{P}_{\theta}(\Tau)} \left[\left(\prod_{t=1}^\Tau \frac{\pi_{\theta^\prime}(a_t|s_t)}{\pi_\theta(a_t | s_t)}\right) \,.
\left(\sum_{t=1}^\Tau\nabla_{\theta} \log\pi_{\theta^\prime}(\Tau)\right) .\, r(\Tau)\right]


\\
&\approx \mathbb{E}_{\Tau \sim \mathcal{P}_{\theta}(\Tau)} \left[\sum_{t=1}^\Tau\nabla_{\theta} \log\pi_{\theta^\prime}(\Tau) \,.\,\frac{\pi_{\theta^\prime}(a_t|s_t)}{\pi_\theta(a_t | s_t)}
\, .\, \hat{\mathrm{Q}}_t\right]



\end{align*}
J(θ′)​=ET∼Pθ′​(T)​[Pθ′​(T)r(T)]=ET∼Pθ​(T)​[.r(T)]​∇θ′​J(θ′)​=ET∼Pθ′​(T)​[Pθ′​(T)∇θ′​logPθ′​(T)r(T)]=ET∼Pθ​(T)​[Pθ​(T)Pθ′​(T)​∇θ​logPθ′​(T)r(T)]=ET∼Pθ​(T)​[(t=1∏T​πθ​(at​∣st​)πθ′​(at​∣st​)​).(t=1∑T​∇θ​logπθ′​(T)).r(T)]≈ET∼Pθ​(T)​[t=1∑T​∇θ​logπθ′​(T).πθ​(at​∣st​)πθ′​(at​∣st​)​.Q^​t​]​