World Models and VLAs

World Models and VLAs

original world models paper (link):
V: visual perception is encoded into a latent space using a VAE
M: transitions are modeled using an MDN-RNN (mixture of gaussian distribution)
C: the agent (controller) is a linear projection of the hidden state vector of M and the latent vector of V
each module is trained seperately, as a result V and M weren’t trained using the reward signal eliminating the sparse updates and focusing on reconstruction and compression as a task
(intuition note: sounds like the SSL objective in vision focusing on riche generalizable representations over task specific signals e.g classification sparse feature cues)
can the M module predict reward for the frame ?
next token pred lesson to stabilize RL actions in OOD / cumulative error trajectories
world model “absorting” long term skills, turning information into memory
dreaming: training the controller / agent in the RNN hallucination (autoregressive prediction of world evolution)
Dino world model paper (link):
inverse model: (s_t, s_{t+1} → a_t)
MPC in latent space: given a starting and ending frames we roll trajectories of actions and infer intermediate states using dino-wm and pick the trajectory / sequence of actions that minimizes the distance between the final state and ending target state in latent space
model-based learning:
predicting in pixel space: computationally expensive, rely on recon
predicting in latent space: recon objective, task specific featrues / signal isn’t guaranteed, many incorporate reward prediction as an objective / aux task
video generative models as world models: predicting next frame conditioned on agent action
Dino-wm: 
encoder / observation model: latent repr from images
transition model: next frame latent repre from last k frame latent repr and last k actions
decoder: image from latent representation
next frame tokens are predicted by attending to all tokens in previous frames, predicted in parallel
zti∼pθ({zt−H:t−1i}i=1N, at−H:t−1)\Large{z^i_t \sim p_\theta(\{z^i_{t-H:t-1}\}^N_{i=1}, \, a_{t-H:t-1})}zti​∼pθ​({zt−H:t−1i​}i=1N​,at−H:t−1​)
training: 
given a sequence of observations and actions train the model in an autoregressive fashion (using causal attention) to predict the next observation given the sequence of the last H observations conditioned on the last H actions
inference: 
given the encoded initial and goal observations, using a search algorithm search for the best sequence of actions to reach the goal observation (based on distance of end observation prediction and goal observation latents)
take the first action from that sequence 
repeat H times
MLP encoded action vectors and optional proprioceptive information are concatenated to each visual patch token to collectively form the latent state
training objective
Lpred=∥pθ(encθ(ot−H:t), ϕ(at−H:t)) − encθ(ot+1)∥2\mathcal{L}_{pred} = \|p_\theta(\mathrm{enc}_\theta(\mathrm{o_{t-H:t}}), \, \phi(\mathrm{a_{t-H:t}})) \, - \, \mathrm{enc}_\theta(o_{t+1})\|²Lpred​=∥pθ​(encθ​(ot−H:t​),ϕ(at−H:t​))−encθ​(ot+1​)∥2
actions trajectory search in latent space is blank using CEM: 
sampling distribution = gaussian
sample N random action sequences from sampling distribution
evaluate each trajectory’s objective
update the sampling distribution mean and std using the top k trajectories
repeat
JEPA variants
I-JEPA (link)
invariance based methods (photometric/ physical augs, views, 2 encoders)
generative based methods (in-painting, MAE, …)
JEPA is highly efficent and scalable, never process all patch tokens in an image, processes a single view, prediction in latent space
optional RCDM decoder to visualize captured latent representation (view / aug invariant, higher on the semantic level scale features)
V-JEPA 2 & V-JEPA 2-AC (link)
pre-trained a video model in a SSL fashion, that transfer to downstream tasks: action cls, obj recognition, action anticipation and VQA (aligning with an LLM)
the obtained model representation could be fed into a generative head contiditioned on actions (V-JEPA 2-AC) for robot manipulation tasks
blank input videos into 2x16x16 patches, apply 3D-RoPE
scaling recipe: 
more data (@64x384x384)
larger ViT model
longer training w/ warmup-constant-decay
progressive resolution adaptation (16x256x256 → 64x384x384)
(trained using a warmup-constant-cooldown lr schedule and fixed ema & wd, for more cost effective exploration)
evaluation:
something something v2
fine-grained action recognition (e.g [puting something into something], [pretending to put something into something] …)
MIT Diving 48
sports action recognition
20BN Jester
hand gestures recognition
GDM Kinetics
large-scale action recognition (100s of classes)
COIN
unstructional video understanding providing 
video-level task labels
step-level annotations
temporal boundaries (timestamps for each step)
ImageNet
V-JEPA 2-AC: tuning the predictor to be action conditioned
7D end-effector action vector: 3D cartesian position, 3D orientation as extrinsic euler angles, 1D gripper state
Lteacher_forcing=1T∑∥Pϕ((at,st,E(xt))t≤k)−E(xt+1)∥1Lrollout(ϕ)=∥Pϕ(a1:T,st,E(xt))∥1L(ϕ)=Lrollout(ϕ)+Lteacher−forcing(ϕ)\large
\mathcal{L}_{teacher\_forcing} = \frac{1}{T}\sum\|P_\phi\Large{(}\large{(}a_t, s_t, \mathrm{E}(x_t))_{t\le k}\Large{) - \mathrm{E}(x_{t+1}) }\|_1

\\[1em]
\large
\mathcal{L}_{rollout}(\phi) = \| P_\phi\Large{(}\large a_{1:T}, s_t, \mathrm{E}(x_t) \Large{)} \|_1
\\[1em]
\Large{\mathcal{L}(\phi) = \mathcal{L}_{rollout}(\phi) + \mathcal{L}_{teacher-forcing}(\phi)}
Lteacher_forcing​=T1​∑∥Pϕ​((at​,st​,E(xt​))t≤k​)−E(xt+1​)∥1​Lrollout​(ϕ)=∥Pϕ​(a1:T​,st​,E(xt​))∥1​L(ϕ)=Lrollout​(ϕ)+Lteacher−forcing​(ϕ)
LeJEPA (link)
SSL via pure geometric alignment, pull augmented views toward their per-sample centroid while enforcing the batch distribution to be isotropic gaussian. No teachers, no predictors, no reconstruction
introduced SIGReg regularization to force isotopic gaussian distribution which their proved it as an optimal representation distribution for downstream tasks
Causal-JEPA (link)
object-centric based JEPA using object slots instead of regular patches as tokens
uses an object-centric encoder (via slot attention) to represent scenes as variable sets of object tokens rather than fixed spatial patches
applies object-level masking, replacing object repr with identity anchor/ref + temporal emb
effectively forcnig causal reasoning by elimination intra-object / self-dynamics shortcut solutions and cues
Lmask=E[∥z^τ i−zτ i∥2  |  i∈M,  τ≤t]+E[∥Z^τ−Zτ∥2  |  τ>t]where masked input:z~τ i=ϕ(zt0 i)+eτL_{mask} = \mathbb{E}\left[\| \hat{z}_{\tau}^{\,i} - z_{\tau}^{\,i} \|^2 \;\middle|\; i \in M,\; \tau \le t \right]
 + \mathbb{E}\left[\| \hat{Z}_{\tau} - Z_{\tau} \|^2 \;\middle|\; \tau > t \right]     

\\\text{where masked input:} \quad\quad
\tilde{z}_{\tau}^{\,i} = \phi\left(z_{t_0}^{\,i}\right) + e_{\tau}Lmask​=E[∥z^τi​−zτi​∥2​i∈M,τ≤t]+E[∥Z^τ​−Zτ​∥2​τ>t]where masked input:z~τi​=ϕ(zt0​i​)+eτ​
VLA-0 paper (link):
using a VLM for action planning out of the box
descrete VLAs: quantize action space and generate action instructions autoregressively
generative head VLAs: use a cutom head to decode the continuous latent vectors from the VLM (often diffusion, flow-matching, …)
custom architecture: custom modules, emebeddings, task conditioning, …
main recipe: 
action decoding
ensemble prediction
masked action augmentation
model: 
(system prompt, images, task instruction) → VLA0 → next H actions
used system prompt
Analyze the input image and predict
robot actions for the next H timesteps. Each action has D
dimensions. Output a single sequence of H ×D integers (0 -
B each), representing the H timesteps sequentially. Provide
only space-separated numbers. Nothing else.
action decoding: 
each control module action is outputed by the model in the form of a number (bounded in the system prompt), resulting in a sequence of numbers as the output instructing the action at each time step
ensemble prediction:
at each time step t the model predict the next H actions
resulting in multiple predictions for the action at+na_{t+n}at+n​﻿​
masked action augmentation:
randomly masking tokens in the ground truth sequence
knowledge insulation: 
forcing the model to rely on perception more than memory / generic world knowledge gained during pre-training (e.g outputing the “open door” action in the output sequence when the door is already open)
“free-running” training vs teacher forceing:
autoregressive training using causal attention supervised using ground truth trajectory vs final state
Φ\large{\mathbb{\Phi}}Φ﻿eat: physically-grounded feature repr paper (link)
dinov3 checkpoint further trained (SSL fashion) augmenting representations to be physically grounded while preserving semantic richness
dataset: rendered objects as combination of templates (objects) and materials while preserving semantic appearance in real-world (marble can’t be a fold sheet, …) under a variety of environments (lighting conditions, orientations, shadows, …) 
Loss cocktail: 
the same loss functions used to train Dinov3 along with an in-batch infoNCE loss
Lcontrast=−1N∑i=1Nlog∑j∈P(i)exp(sim(zi,zj)/τ)∑k≠iexp(sim(zi,zj)/τ),where zi=normFrobenius(Pis,g)L=LDINO+λp LiBOT+λk LKoLeo+λg LGram+λc Lcontrast\mathcal{L}_\text{contrast} = -\frac{1}{N} \large{\sum_{i=1}^N}\mathrm{log}\frac{\sum_{j\in\mathcal{P}(i)}\mathrm{exp}(\mathrm{sim}(z_i, z_j)/\tau)}{\sum_{k\ne i}\mathrm{exp}(\mathrm{sim}(z_i, z_j)/\tau)}, \quad \text{where } z_i = \mathrm{norm}_\text{Frobenius}(P^{s, g}_{i})
\\[1em]
\Large{\mathcal{L} = \mathcal{L}_\text{DINO} + \lambda_\text{p} \,\mathcal{L}_\text{iBOT} + \lambda_\text{k} \,\mathcal{L}_\text{KoLeo}+ \lambda_\text{g}\,\mathcal{L}_\text{Gram} + \lambda_\text{c}\,\mathcal{L}_\text{contrast} }Lcontrast​=−N1​i=1∑N​log∑k=i​exp(sim(zi​,zj​)/τ)∑j∈P(i)​exp(sim(zi​,zj​)/τ)​,where zi​=normFrobenius​(Pis,g​)L=LDINO​+λp​LiBOT​+λk​LKoLeo​+λg​LGram​+λc​Lcontrast​
Geometry-aware RoPE (link)
viewRoPE: spacial inconcictency / hallucination i.e failure to maintain 3D structure over long trajectories is due to reliance on screen-space pos emb
they inject camera-ray directions into video ViT attention

World Models and VLAs

original world models paper (link):

Dino world model paper (link):

training objective

JEPA variants

I-JEPA (link)

V-JEPA 2 & V-JEPA 2-AC (link)

LeJEPA (link)

Causal-JEPA (link)

VLA-0 paper (link):

Φ\large{\mathbb{\Phi}}Φ﻿eat: physically-grounded feature repr paper (link)

Geometry-aware RoPE (link)

$\large{\mathbb{\Phi}}$ eat: physically-grounded feature repr paper (link)