Background: World Modeling

Overview

  • World modeling is the study of learning internal predictive representations of an environment so an agent can infer what is true now, what is likely to happen next, and what would happen under possible actions. A minimal world model can be written as a transition model over latent states:

    \[\hat{z}_{t+1}=f_\theta(z_t,a_t)\]
    • where \(z_t\) is a compact representation of the current observation, \(a_t\) is an action, and \(\hat{z}_{t+1}\) is the predicted next latent state. World Models by Ha and Schmidhuber (2018) established the modern neural framing of learning compressed spatial and temporal representations that can support policy learning inside a learned model.
  • A more operational definition starts from the agent-environment loop: an agent selects actions, actions change the world state, observations expose only a partial view of that state, and new observations inform future actions. Reinforcement Learning: An Introduction by Sutton and Barto (2018) formalizes this loop through Markov decision processes and partially observable Markov decision processes, making it the control-theoretic substrate for most world-model definitions. A Functional Taxonomy of World Models distinguishes three outputs of this loop: renderers output observations, simulators output states, and planners output actions.

  • The following figure (source) shows a functional taxonomy in which renderers produce observations, simulators model state, and planners select actions.

Renderer World Models

  • Renderer-style world models generate observations, typically pixels, videos, or interactive views. Their primary contract is visual fidelity: given a prompt, state estimate, camera motion, or user input, they synthesize what an observer would see. This includes text-to-video and interactive generation systems, where the model may create plausible visual sequences without maintaining a fully explicit physical state. A Functional Taxonomy of World Models frames video models and interactive visual systems as renderers because their output is observation-level appearance rather than directly computable state.

  • Renderer models are valuable for imagination, visualization, and human-facing interaction, but visual plausibility is not the same as physical validity. A generated environment can look coherent while lacking metric geometry, stable object identity, or physically meaningful collision behavior.

Simulator World Models

  • Simulator-style world models output state: geometry, materials, object layouts, dynamics, or other representations that downstream programs can compute on. Their primary contract is structural fidelity rather than only visual fidelity. A simulator must support inspection, interaction, counterfactual evaluation, and repeated rollouts under intervention.

  • This paradigm includes classical physics engines, digital twins, robotics simulators, and newer generative 3D world models. Marble: A Multimodal World Model describes a multimodal system that creates editable 3D worlds from text, image, video, or coarse 3D layouts and exports worlds as Gaussian splats, meshes, or videos, illustrating the renderer-simulator boundary

  • Simulator world models are especially important for robotics, autonomous vehicles, engineering, game development, and scientific modeling because they provide a substrate for testing actions safely and cheaply before deployment.

Planner World Models

  • Planner-style world models output actions. Given an observation, a latent state, and a goal, a planner selects what should happen next:

    \[a_t^*=\arg\min_{a_t} C(z_t,a_t,z_g)\]
    • where \(C\) is a goal-conditioned cost and \(z_g\) is a target state. Planners may use a learned dynamics model, a value function, search, model predictive control, or a policy network. Dream to Control: Learning Behaviors by Latent Imagination by Hafner et al. (2019) is a canonical example of learning compact latent dynamics and training behavior through imagined rollouts.
  • Planner world models close the perception-action loop. They are most directly connected to embodied AI because their output is not an image or a scene description, but an intervention in the world.

The Simulation Bottleneck

  • Among renderer, simulator, and planner paradigms, simulation is often the bottleneck because it links visual appearance to action consequences. A renderer can synthesize observations, and a planner can choose actions, but a simulator represents the structural substrate from which both visual observations and action-conditioned futures can be derived. A Functional Taxonomy of World Models argues that simulation is the bridge between rendering and planning because geometry, physics, and dynamics are the underlying structures needed by both.

  • This makes simulator-quality representations central to spatial intelligence. The challenge is that explicit 3D, material, physical, and robot-interaction data are far scarcer than internet-scale images and video, and generated 3D assets can look plausible while still containing scale errors, self-intersections, or physically invalid structure.

Toward Unified World Models

  • The strongest long-term direction is a unified model that can render observations, simulate state, and plan actions using shared latent knowledge. In such a system, a cup on a table would not merely be a texture pattern in pixels; it would have geometry, pose, material properties, affordances, and action-conditioned consequences.

  • The following figure (source) shows the convergence toward unified world models that combine rendering, simulation, and planning. Specifically, it shows a unified world-model architecture in which rendering produces interpretable observations, simulation maintains and evolves world state, and planning selects actions by evaluating predicted futures.

  • This unified framing clarifies why world modeling is broader than video generation, robotics policy learning, or simulation alone. These are not isolated categories. They are projections of the same underlying problem: learning the structure of space, time, objects, dynamics, and agency.

JEPA as a Latent Predictive World Model

  • A central design question is whether the model should predict pixels, tokens, latent states, object slots, or task-relevant abstractions. Pixel-level generative models learn rich observation distributions, but they spend capacity on high-entropy details that may be irrelevant for planning, such as exact texture or background minutiae. JEPA-style models instead predict in representation space, biasing learning toward predictable semantic structure rather than full reconstruction. A Path Towards Autonomous Machine Intelligence by LeCun (2022) frames this as a path toward systems that learn predictive world models, reason, and plan through self-supervised learning rather than relying only on supervised labels or reinforcement rewards. (openreview.net)

  • Joint-Embedding Predictive Architectures, or JEPAs, are a family of self-supervised models that learn by predicting the embedding of one signal from another compatible signal. Instead of reconstructing \(y\) directly, a JEPA learns encoders and a predictor such that:

    \[s_x=f_\theta(x), \qquad s_y=f_{\bar{\theta}}(y), \qquad \hat{s}_y=g_\phi(s_x,z)\]
    • and optimizes a latent prediction loss such as:

      \[\mathcal{L}_{\text{JEPA}}=\left|\hat{s}_y-s_y\right|_2^2\]
  • The key shift is that compatibility is measured in embedding space rather than input space. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture by Assran et al. (2023) introduced I-JEPA, where a Vision Transformer context encoder predicts representations of masked target blocks using an EMA target encoder and a predictor network.

  • The following figure (source) shows common architectures for self-supervised learning, in which the system learns to capture the relationships between its inputs. The objective is to assign a high energy (large scaler value) to incompatible inputs, and to assign a low energy (low scaler value) to compatible inputs. (a) Joint-Embedding Architectures learn to output similar embeddings for compatible inputs \(x, y\) and dissimilar embeddings for incompatible inputs. (b) Generative Architectures learn to directly reconstruct a signal \(y\) from a compatible signal \(x\), using a decoder network that is conditioned on additional (possibly latent) variables \(z\) to facilitate reconstruction. (c) Joint-Embedding Predictive Architectures learn to predict the embeddings of a signal \(y\) from a compatible signal \(x\), using a predictor network that is conditioned on additional (possibly latent) variables \(z\) to facilitate prediction.

  • In world modeling, JEPA is best understood as a latent-space predictive model. Its appeal is that it can model the predictable consequences of perception and action without forcing the system to model every observation detail. In images, I-JEPA predicts masked spatial regions; in video, V-JEPA and V-JEPA 2 predict masked spatiotemporal regions; in robotics, action-conditioned variants predict future latent states conditioned on control inputs. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning by Assran et al. (2025) scales this recipe to internet-scale video and then post-trains an action-conditioned predictor with robot trajectories for planning.

  • The following figure (source shows the V-JEPA 2 training and deployment pipeline from large-scale video pretraining to downstream understanding and planning tasks. Specifically, large-scale video pretraining produces a video encoder for understanding and prediction, and action-conditioned post-training turns the frozen representation space into a planning-capable latent world model. Leveraging 1M hours of internet-scale video and 1M images, V-JEPA 2 is pretrained as a video model using a visual mask denoising objective, and this model is leveraged for downstream tasks such as action classification, object recognition, action anticipation, and Video Question Answering by aligning the model with an LLM backbone. After pretraining, we can also freeze the video encoder and train a new action-conditioned predictor with a small amount of robot interaction data on top of the learned representations, and leverage this action-conditioned model, V-JEPA 2-AC, for downstream robot manipulation tasks using planning within a model predictive control loop.

  • A practical JEPA implementation usually contains four components. First, an encoder maps observations into latent tokens. Second, a target encoder, often an exponential moving average of the context encoder, provides stable targets. Third, a predictor maps context representations, target-position tokens, temporal tokens, or action embeddings into predicted target representations. Fourth, an anti-collapse mechanism prevents the trivial solution where all inputs map to the same embedding. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels by Maes et al. (2026) proposes an end-to-end JEPA from pixels with a next-embedding prediction loss plus a Gaussian latent regularizer. (arxiv.org)

  • The core implementation distinction between generative world models and JEPA world models is therefore the training target:

\[\text{Generative: } \mathcal{L}=-\log p_\theta(o_{t+1}\mid o_{\le t},a_{\le t})\] \[\text{JEPA: } \mathcal{L}=\left|g_\phi(f_\theta(o_{\le t}),a_t)-f_{\bar{\theta}}(o_{t+1})\right|_2^2\]
  • The second objective avoids an observation likelihood and directly trains the model to predict latent structure useful for perception, dynamics, and control. V-JEPA 2 by Assran et al. (2025) reports that this representation-space prediction supports motion understanding, action anticipation, video question answering after language alignment, and robot manipulation through latent model-predictive control. (arxiv.org)

  • JEPA has also expanded beyond images and video. A-JEPA: Joint-Embedding Predictive Architecture Can Listen by Fei et al. (2023) adapts JEPA to audio spectrograms with curriculum masking from random blocks to time-frequency-aware masks. (arxiv.org) DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture by He et al. (2025) orders target-region prediction using attention-derived saliency, turning flat latent prediction into a sequential curriculum. (arxiv.org) Causal-JEPA: Learning World Models through Object-Level Latent Interventions by Nam et al. (2026) moves masking from patch-level features to object-centric slots, making interaction reasoning necessary by requiring masked object states to be inferred from other objects. (arxiv.org)

  • For the rest of the primer, the natural progression is: foundations of world models, functional world-model paradigms, JEPA mechanics, I-JEPA, video JEPA and V-JEPA 2, action-conditioned planning, object-centric and causal JEPA, probabilistic JEPA variants, collapse prevention, and implementation recipes.

Foundations of World Modeling

  • World modeling rests on the premise that intelligence requires an internal model capable of predicting the consequences of observations and actions. This internal model must encode sufficient information about the environment to support perception, reasoning, and planning, while remaining compact and computationally tractable.

The Agent-World Loop

  • A world model is most naturally situated inside an agent-world loop. The world has a latent state, the agent receives partial observations of that state, the agent chooses actions, and the world transitions to a new state. In this framing, world modeling is not only about generating pixels; it is about learning the structure that connects state, observation, and action.
\[s_t \rightarrow o_t \rightarrow a_t \rightarrow s_{t+1}\]
  • Reinforcement Learning: An Introduction by Sutton and Barto (2018) formalizes the agent-environment loop through Markov decision processes and partially observable Markov decision processes, providing the mathematical substrate for action-conditioned world modeling.

  • A Functional Taxonomy of World Models frames this loop functionally: renderers map state or actions to observations, simulators model state transitions, and planners map observations or latent state estimates to actions. This distinction is useful because it separates world models by the kind of output they are designed to produce: observations, states, or actions.

Formal Definition of a World Model

  • A world model is typically defined as a latent dynamical system:

    \[z_t = f_\theta(o_{\le t}), \qquad z_{t+1} \sim p_\theta(z_{t+1} \mid z_t, a_t)\]
    • where \(o_t\) denotes observations, \(z_t\) is a latent representation, and \(a_t\) is an action. The model may also include a decoder \(\hat{o}_t \sim p_\theta(o_t \mid z_t)\) depending on whether reconstruction is required.
  • The functional taxonomy refines this definition by asking what the model is supposed to output. A renderer primarily estimates \(p(o_{t+1}\mid z_t,a_t)\), a simulator estimates \(p(z_{t+1}\mid z_t,a_t)\), and a planner estimates or optimizes \(a_t\) given observations, goals, and predicted futures. A Functional Taxonomy of World Models uses this output-based distinction to clarify why many systems called “world models” are solving related but different problems.

  • This distinction also clarifies the different training contracts. Renderer world models optimize observation fidelity, often through diffusion or autoregressive sequence modeling. Simulator world models optimize state validity, such as geometric consistency, physical dynamics, or latent transition accuracy. Planner world models optimize decision quality, often through search, model predictive control, value learning, or policy learning inside imagined trajectories.

  • In control settings, the model is often embedded within a planning objective:

    \[a_{t:t+H}^* = \arg\max_{a_{t:t+H}} \mathbb{E} \left[ \sum_{k=0}^{H} r(z_{t+k}, a_{t+k}) \right]\]
    • where planning occurs by simulating trajectories in latent space. This formulation highlights that the quality of \(z_t\) directly determines planning performance.

Renderer Paradigm

  • Renderer world models output observations. They are trained or prompted to produce pixels, videos, views, or sensory predictions. Their central question is: what would the world look like from this condition?

  • A renderer can be written as:

    \[\hat{o}_{t+1} \sim p_\theta(o_{t+1}\mid z_t,a_t,c)\]
    • where \(c\) may include a text prompt, camera pose, previous frame, latent scene code, action sequence, or interaction command.
  • This paradigm includes image diffusion, video diffusion, interactive video systems, neural rendering, text-to-3D-to-video systems, and action-conditioned environment renderers. High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022) shows how latent diffusion makes high-resolution observation generation computationally practical, while Video Diffusion Models by Ho et al. (2022) extends diffusion to temporally coherent video generation. Renderer models are often evaluated by visual fidelity, temporal coherence, controllability, prompt adherence, action adherence, and long-horizon stability.

  • Interactive renderers make the renderer paradigm look more world-model-like because they condition generation on actions. GAIA-1: A Generative World Model for Autonomous Driving by Hu et al. (2023) combines video, text, and action tokens to generate controllable driving futures, and Genie: Generative Interactive Environments by Bruce et al. (2024) learns latent actions from unlabeled video to create playable generated environments. These models produce action-conditioned observations, but their output remains rendered sensory experience rather than an explicit, inspectable physical state.

  • Renderer models are powerful but incomplete as world models. A visually plausible rollout may not preserve object permanence, metric geometry, material consistency, or physically valid dynamics. This matters because planning and control require not just what looks plausible, but what is causally and physically possible.

Simulator Paradigm

  • Simulator world models output state. They represent the structure of the environment in a form that can be queried, edited, rolled forward, or used for downstream computation. Their central question is: what is the world state, and how does it change?

  • A simulator can be written as:

    \[z_{t+1} \sim p_\theta(z_{t+1}\mid z_t,a_t)\]
    • where \(z_t\) may encode geometry, object pose, materials, contact state, dynamics, mesh structure, radiance fields, Gaussian splats, particle states, graph relations, or symbolic variables.
  • Simulator world models include physics engines, digital twins, robotics simulators, neural scene representations, 3D generative worlds, graph-based physical simulators, mesh-based simulators, and latent dynamics models. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis by Mildenhall et al. (2020) represents scenes as continuous radiance fields that can be queried from novel viewpoints, while 3D Gaussian Splatting for Real-Time Radiance Field Rendering by Kerbl et al. (2023) represents scenes with explicit Gaussian primitives that support real-time rendering. Marble: A Multimodal World Model describes a system for creating editable 3D worlds from text, images, video, or 3D layouts and exporting them as Gaussian splats, meshes, or videos, illustrating how generative world systems can move from pure rendering toward editable state.

  • Simulation is the bridge between visual generation and embodied action. A renderer may produce a believable frame, but a simulator must preserve the underlying state so that actions have stable, repeatable consequences. Learned physical simulators make this explicit: Interaction Networks for Learning about Objects, Relations and Physics by Battaglia et al. (2016) models object-relation dynamics, Learning to Simulate Complex Physics with Graph Networks by Sanchez-Gonzalez et al. (2020) learns particle-based dynamics through message passing, and Learning Mesh-Based Simulation with Graph Networks by Pfaff et al. (2020) learns mesh-based simulation for scientific and engineering domains.

  • This is why simulator world models are central for robotics, autonomous driving, AR/VR, engineering design, and scientific experimentation. They support counterfactual queries, editable state, physical rollouts, and integration with planners.

Planner Paradigm

  • Planner world models output actions. They use observations, latent states, goals, costs, rewards, value estimates, and predicted futures to decide what an agent should do.

  • A planner can be written as:

\[a_t^* = \arg\min_{a_t} C(z_t,a_t,z_g)\]

Desiderata for Effective World Models

  • An effective world model must satisfy three core properties:

    • Predictive sufficiency with task relevance: the latent state must contain the information necessary to predict future observations, states, rewards, values, or action consequences, while preserving the variables that matter for the model’s functional role.
    • Compactness with controllable detail: the representation should discard irrelevant variability, but it should not discard small visual, geometric, or physical details that are decision-critical.
    • Compositional structure with intervention support: the representation should factorize into entities, relations, geometry, dynamics, or task variables so that the system can support reasoning, editing, counterfactuals, and action-conditioned rollouts.
  • These requirements are often in tension. For example, maximizing predictive accuracy can encourage encoding irrelevant details, while excessive compression can remove necessary information.

  • The functional taxonomy adds another requirement: output alignment. A renderer should optimize observation quality, a simulator should optimize state validity, and a planner should optimize action quality. A system can perform well under one contract while failing under another, which is why world-model evaluation must match the intended functional role.

Representation Learning in World Models

  • The central challenge is learning a representation \(z_t\) that balances invariance and equivariance:

    • Invariance for semantic abstraction removes nuisance variability such as lighting, texture, or background clutter when those details are not relevant to prediction or control.
    • Equivariance for structured prediction preserves transformation-sensitive structure such as object motion, viewpoint change, geometry, pose, and action-conditioned state transitions.
  • This trade-off is fundamental. seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models by Ghaemi et al. (2026) shows that standard self-supervised learning methods struggle to simultaneously capture both properties, motivating architectures that explicitly separate invariant and equivariant representations.

  • For simulator-style world models, equivariance is especially important because the model must preserve how state changes under viewpoint shifts, object motion, and action interventions. For renderer-style models, invariance may improve semantic control, but excessive invariance can remove the geometric details needed for coherent view synthesis. For planner-style models, the representation must be invariant to irrelevant variation while remaining sensitive to any feature that changes reward, value, safety, or feasibility.

Temporal Abstraction and Dynamics

  • World models must capture temporal dependencies across multiple scales. This includes:

    • Short-term dynamics with local consequences include immediate motion, contact, collision, object persistence, next-frame prediction, and short-horizon control effects.
    • Long-term structure with goal relevance includes environment rules, task progress, value estimates, agent intent, constraints, and delayed consequences.
  • Latent state-space models typically factorize dynamics as:

    \[z_{t+1} = f_\theta(z_t, a_t, \epsilon_t)\]
    • where \(\epsilon_t\) introduces stochasticity. This is essential because real-world environments are partially observable and inherently uncertain.
  • Temporal abstraction differs across the three functional paradigms. Renderers must maintain visual identity and scene consistency across frames. Simulators must preserve state variables over rollouts so that actions have stable consequences. Planners must reason across horizons, often combining short-horizon model predictions with long-horizon value functions to avoid compounding error.

  • For instance, LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels by Maes et al. (2026) presents a JEPA-based latent dynamics pipeline in which an encoder produces latent states and a predictor models transitions across time.

  • In such systems, the encoder compresses observations into \(z_t\), and the predictor learns a transition function over latent space.

Uncertainty and Partial Observability

  • Real-world environments are not fully deterministic. A world model must represent uncertainty over future states:

    \[p(z_{t+1} \mid z_t, a_t)\]
    • rather than a single deterministic prediction.
  • Probabilistic formulations extend latent dynamics models with belief states:

\[b_t = p(z_t \mid o_{\le t}, a_{<t})\]
  • This connects world modeling to partially observable Markov decision processes (POMDPs), where the agent maintains a distribution over latent states.

  • Uncertainty takes different forms across the taxonomy. Renderer uncertainty appears as multiple plausible observations or videos. Simulator uncertainty appears as multiple plausible physical states or transitions. Planner uncertainty appears as risk over action outcomes, value estimates, and model exploitation. PETS by Chua et al. (2018) makes this explicit for planning by propagating uncertainty through sampled model rollouts, while Variational JEPA: Probabilistic World Models by Huang (2026) extends JEPA into a probabilistic framework, learning a predictive distribution over latent states rather than point estimates, enabling uncertainty-aware planning and filtering.

Object-Centric and Relational Structure

  • A key limitation of monolithic latent representations is their inability to explicitly represent interactions between entities. Object-centric world models address this by decomposing the latent state into a set of object representations:
\[z_t = {z_t^{(1)}, z_t^{(2)}, \dots, z_t^{(N)}}\]
  • This structure allows modeling interactions such as collisions, occlusions, contact, containment, support, force transfer, and causal relationships.

  • Object-centric and relational structure appears in several simulator families. Interaction Networks by Battaglia et al. (2016) uses object and relation graphs to learn physical dynamics, while Visual Interaction Networks by Watters et al. (2017) infers object-centric latent states from video and rolls them forward with relational dynamics. Causal-JEPA: Learning World Models through Object-Level Latent Interventions by Nam et al. (2026) demonstrates that object-level masking forces the model to infer interactions, introducing a causal inductive bias into representation learning.

  • For instance, Causal-JEPA: Learning World Models through Object-Level Latent Interventions by Nam et al. (2026) uses object-level masking over object-centric latent slots so the model must infer a masked object’s state from the surrounding objects, making relational interaction and counterfactual-like reasoning necessary rather than optional.

Learning Paradigms for World Models

  • World models can be learned through three primary paradigms:

    • Supervised learning with explicit state or transition labels uses labeled geometry, physics state, simulator traces, object attributes, or action-conditioned transitions when such annotations are available.
    • Reinforcement learning through interaction and reward signals learns by collecting experience, improving policies, and updating models or value functions from task feedback.
    • Self-supervised learning from raw observations learns predictive structure from images, videos, audio, proprioception, or multimodal streams without requiring explicit labels.
  • Self-supervised learning is particularly attractive because it scales with unlabeled data. JEPA belongs to this category, learning predictive representations without requiring reconstruction or rewards.
  • Renderer, simulator, and planner systems can all be trained with self-supervised learning, but they differ in the supervision signal. Renderers often learn from observation prediction or denoising. Simulators benefit from state, geometry, multi-view, physical, or rollout consistency. Planners require some form of action-quality signal, such as rewards, values, preferences, demonstrations, task embeddings, or goal-conditioned costs.

Limitations of Existing Approaches

  • Despite significant progress, current world models face several limitations:

    • Overfitting to observation details in generative models can make renderers visually impressive while wasting capacity on high-entropy details that are irrelevant for state prediction or action selection.
    • Representation collapse in latent predictive models can produce compact embeddings that minimize a prediction objective while discarding information needed for downstream reasoning.
    • Poor generalization across tasks and environments can occur when models learn domain-specific shortcuts rather than reusable structure.
    • Limited causal reasoning in patch-based representations can make models sensitive to local correlations while failing to represent entities, interventions, and interaction structure.
  • Functional limitations are equally important:

    • Renderer limitation: visual plausibility does not guarantee physical validity, long-horizon persistence, action adherence, or inspectable state.
    • Simulator limitation: editable, computable, physically valid state is difficult to learn at internet scale, especially when explicit 3D or physical supervision is scarce.
    • Planner limitation: good actions require reliable predictive state, calibrated uncertainty, robust value estimation, and objective specification that prevents model exploitation.
  • These challenges motivate the design of architectures that enforce structure in representation space, incorporate inductive biases for interaction, and scale with large datasets.

Transition to the Main World-Model Paradigms

  • Renderer models, simulator models, planner models, and JEPA each address different limitations described above. Renderer models mitigate the need for human-interpretable imagination by producing observations that can be inspected, edited, and used for synthetic experience. Simulator models mitigate the weakness of purely visual generation by representing state in a form that can be queried, rolled forward, and tested under counterfactual interventions. Planner models mitigate the gap between prediction and agency by using learned futures to select actions, optimize policies, and evaluate consequences. JEPA mitigates the inefficiency of reconstruction-heavy modeling by directly optimizing predictive structure in latent space.

  • From the functional-taxonomy perspective, JEPA is closest to the simulator paradigm when it learns latent state transitions, and it becomes a planner substrate when the learned latent dynamics are paired with model predictive control, tree search, value estimation, or goal-conditioned optimization. Its main distinction from renderer-first models is that it does not require observation reconstruction as the primary training objective.

  • The next sections examine these paradigms in detail: renderer models, simulator models, planner models, and then JEPA, with emphasis on architectures, training objectives, implementation principles, and failure modes.

Renderer World Models

Renderer World Models as Generative Observation Models

  • Renderer world models are models whose primary output is an observation: an image, video, frame sequence, view, or interactive visual stream. In the functional taxonomy, they answer the question:

    \[\hat{o}_{t:t+H} \sim p_\theta(o_{t:t+H}\mid c)\]
    • where \(c\) may include text, an image, a video prefix, a camera path, actions, or structured conditioning. Unlike simulator-first models, renderers do not necessarily expose an explicit physical state. Their central contract is observation fidelity: the generated world should look coherent, controllable, temporally stable, and consistent with the conditioning signal.

Image Diffusion as the Foundation of Renderer World Models

  • Modern renderer world models are largely built on diffusion modeling. A diffusion model learns to reverse a noising process:

    \[q(x_t \mid x_0)=\mathcal{N}(\alpha_t x_0,\sigma_t^2 I)\]
    • and trains a denoising network to estimate the noise or clean signal:

      \[\mathcal{L}_{\text{diff}}= \mathbb{E}_{x,\epsilon,t} \left[ \left| \epsilon-\epsilon_\theta(x_t,t,c) \right|_2^2 \right]\]
  • High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022) made diffusion practical at high resolution by moving denoising from pixel space into a learned latent space, preserving perceptual detail while reducing training and inference cost.

  • The following figure (source) shows how latent diffusion preserves reconstruction quality with milder spatial downsampling than earlier latent generative models. Specifically, it shows the impact of boosting the upper bound on achievable quality with less agressive downsampling. Since diffusion models offer excellent inductive biases for spatial data, we do not need the heavy spatial downsampling of related generative models in latent space, but can still greatly reduce the dimensionality of the data via suitable autoencoding models. Images are from the DIV2K validation set, evaluated at \(512^2 \mathrm{px}\). We denote the spatial downsampling factor by \(f\). Reconstruction FIDs and PSNR are calculated on ImageNet-val.

  • The following figure (source) illustrates that LDMs are conditioned either via concatenation or by a more general cross-attention mechanism.

  • The key implementation pattern is two-stage training:

    \[z=E(x), \qquad \hat{x}=D(z)\] \[\mathcal{L}_{\text{LDM}}= \mathbb{E}_{E(x),\epsilon,t} \left[ \left| \epsilon-\epsilon_\theta(z_t,t,c) \right|_2^2 \right]\]
    • where \(E\) and \(D\) are an autoencoder pair, and \(\epsilon_\theta\) is the denoising model trained over latent variables rather than pixels. High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022) also introduced cross-attention conditioning for text, semantic maps, and other structured inputs, making latent diffusion a general-purpose renderer architecture.
  • The following figure (source) shows perceptual and semantic compression, where the autoencoder removes imperceptible details before the diffusion model learns the semantic generative distribution. Most bits of a digital image correspond to imperceptible details. While DMs allow to suppress this semantically meaningless information by minimizing the responsible loss term, gradients (during training) and the neural network backbone (training and inference) still need to be evaluated on all pixels, leading to superfluous computations and unnecessarily expensive optimization and inference. We propose latent diffusion models (LDMs) as an effective generative model and a separate mild compression stage that only eliminates imperceptible details.

Conditioning and Guidance

  • Renderer world models become useful when they are controllable. Conditioning can enter through concatenation, cross-attention, adaptive normalization, or in-context tokens. For text-to-image and text-to-video systems, cross-attention is especially important because it binds visual tokens to language tokens.

  • Classifier-free guidance is a common sampling-time method for strengthening conditioning:

    \[\tilde{\epsilon}_\theta(x_t,c) =\epsilon_\theta(x_t,\varnothing) + s\left(\epsilon_\theta(x_t,c)-\epsilon_\theta(x_t,\varnothing)\right)\]
    • where \(s\) is the guidance scale. Video Diffusion Models by Ho et al. (2022) uses classifier-free guidance in video generation and extends the image diffusion recipe to temporally coherent frame blocks.
  • Guidance improves adherence but can reduce diversity. This is a core renderer trade-off: stronger conditioning makes generated observations more faithful to a prompt or action, but may narrow the distribution of plausible worlds.

Video Diffusion as Temporal Rendering

  • Video renderer models extend image diffusion from a single observation to a block of observations:
\[\hat{o}_{1:T}\sim p_\theta(o_{1:T}\mid c)\]
  • The simplest version denoises an entire spatiotemporal block. Video Diffusion Models by Ho et al. (2022) extends image U-Nets into factorized space-time 3D U-Nets, adding temporal attention after spatial attention so the model can jointly represent appearance and motion.

  • For long videos, short-horizon generation must be extended. Video Diffusion Models by Ho et al. (2022) introduces reconstruction-guided conditional sampling for temporal extension and spatial super-resolution, showing how fixed-window diffusion models can generate longer and higher-resolution sequences.

  • A renderer video model therefore has two coupled requirements:

    • Spatial fidelity: each frame must look plausible and detailed.
    • Temporal fidelity: identities, motion, geometry, and scene layout must remain coherent across time.
  • The first requirement is inherited from image generation; the second is what makes video generation world-model-like.

Transformer Backbones for Renderer Scaling

  • As renderer models scale, transformer backbones become increasingly important. Scalable Diffusion Models with Transformers by Peebles and Xie (2023) replaces the U-Net backbone with a Diffusion Transformer (DiT) that operates over latent patches and finds that increasing forward-pass compute through depth, width, or token count consistently improves sample quality.

  • The following figure (source) shows the Diffusion Transformer architecture, where a noised latent is patchified into tokens and processed by transformer blocks with conditioning through adaptive layer normalization, cross-attention, or in-context tokens. Specifically: (Left) They train conditional latent DiT models. The input latent is decomposed into patches and processed by several DiT blocks. Right: Details of our DiT blocks. They experiment with variants of standard transformer blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.

  • A DiT-style renderer follows the same latent diffusion objective, but changes the denoiser:
\[z_t \rightarrow \text{Patchify}(z_t) \rightarrow \text{Transformer}_\theta(\cdot,t,c) \rightarrow \hat{\epsilon}\]
  • This architectural shift matters because renderer world models increasingly need the same scaling properties that made transformers successful in language and representation learning.

Renderer World Models versus Latent Simulator Models

  • Renderer world models optimize observation likelihood or denoising quality. A renderer objective can be summarized as:
\[\min_\theta \mathbb{E} \left[ \left| \epsilon-\epsilon_\theta(o_t,t,c) \right|^2 \right]\]
  • A latent simulator objective instead predicts compact state:
\[\min_\theta \left| \hat{z}_{t+1}-z_{t+1} \right|^2\]
  • The distinction is not absolute. Many modern systems combine both: a latent world model predicts compressed future tokens, and a renderer decodes those tokens into pixels. GAIA-1: A Generative World Model for Autonomous Driving by Hu et al. (2023) explicitly separates an autoregressive token world model from a video diffusion decoder, using the first for high-level dynamics and the second for high-quality rendering.

Practical Implementation Pattern

  • A renderer world model usually contains the following components:

    • Tokenizer or autoencoder: compresses observations into latent tokens.
    • Generative backbone: U-Net, DiT, spatiotemporal transformer, or autoregressive transformer.
    • Conditioning interface: text, image, video prefix, camera path, action sequence, layout, or multimodal tokens.
    • Sampler: DDPM, DDIM, predictor-corrector, diffusion forcing, or autoregressive decoding.
    • Decoder: maps latent samples back to pixels or video frames.
  • A typical latent diffusion renderer training step is:

x = sample_images_or_video()
z = encoder(x).detach()
t = sample_noise_level()
eps = torch.randn_like(z)

z_t = alpha[t] * z + sigma[t] * eps
eps_pred = denoiser(z_t, t, conditioning)

loss = mse(eps_pred, eps)
loss.backward()
optimizer.step()
  • For video, the tensor shape changes from image latents to spatiotemporal latents \(z \in \mathbb{R}^{T \times H \times W \times C}\) and the model must decide whether to use full spatiotemporal attention, factorized space-time attention, frame stacking, temporal cross-attention, or causal autoregressive generation.

Strengths and Limitations

  • Renderer world models are strongest when the goal is visual imagination, content creation, synthetic data generation, or human-interpretable rollouts. They can generate high-fidelity scenes, interpolate missing frames, extend video, and produce visually rich counterfactuals.

  • Their limitations are equally important:

    • Visual realism does not guarantee physical correctness.
    • Pixel-level objectives can spend capacity on irrelevant details.
    • Long-horizon consistency remains difficult.
    • Action conditioning can drift if the model treats actions as weak visual prompts rather than causal interventions.
    • Generated observations may not expose editable, inspectable state.
  • This is why renderer world models are best treated as one branch of the world-model taxonomy rather than the whole field. They are essential for observation synthesis, but simulator and planner models are still required when the goal is reliable physical prediction or action selection.

Interactive Renderer World Models

  • Interactive renderer world models extend video generation from passive observation synthesis to action-conditioned visual worlds. Instead of generating a fixed clip from text or an image prompt, they repeatedly accept user or agent actions and render the next observation:

    \[\hat{o}_{t+1} \sim p_\theta(o_{t+1}\mid o_{\le t}, a_{\le t}, c)\]
    • where \(c\) may include a text prompt, image prompt, video context, task description, or domain-specific control signal. This makes the renderer appear simulator-like, because it reacts to actions, but its primary output is still observation-level video rather than an inspectable physical state.

Multimodal Driving Renderers

  • GAIA-1: A Generative World Model for Autonomous Driving by Hu et al. (2023) is a multimodal renderer-world-model hybrid for autonomous driving: it maps video, text, and action inputs into discrete tokens, predicts future tokens autoregressively, and decodes them into realistic driving videos with a diffusion decoder.

  • The architecture separates high-level dynamics from pixel rendering:

\[\text{video, text, action} \rightarrow \text{tokens} \rightarrow \text{autoregressive world model} \rightarrow \text{video diffusion decoder}\]
  • This split is important because driving requires both semantic control, such as traffic-light state or weather, and action control, such as ego speed or curvature. In renderer terms, GAIA-1 produces visually realistic future observations; in simulator terms, it partially models driving dynamics through token prediction.

  • The following figure (source) shows GAIA-1 generating driving videos under video, text, and action conditioning, including text-conditioned scene changes and ego-action-conditioned rollouts.

  • The following figure (source) shows the GAIA-1 architecture, where video, text, and action encoders produce tokens, an autoregressive transformer predicts future image tokens, and a video decoder renders the output frames.

Learned Latent Actions and Playable Worlds

  • Genie: Generative Interactive Environments by Bruce et al. (2024) introduces a foundation world model trained from unlabeled internet videos that can generate playable environments from prompts such as text-to-image outputs, sketches, and photographs.

  • Genie’s distinctive contribution is a learned latent action interface. Because most internet videos lack action labels, Genie infers a discrete latent action space that supports frame-by-frame interaction:

    \[\hat{o}_{t+1} \sim p_\theta(o_{t+1}\mid o_{\le t}, \hat{a}_t)\]
    • where \(\hat{a}_t\) is a learned latent action rather than a human-provided ground-truth control label. This makes it an important renderer-world-model design pattern: actions can be induced from video when explicit control annotations are unavailable.
  • The following figure (source) shows Genie converting text-to-image outputs, hand-drawn sketches, and real-world photos into interactive playable environments through a learned latent action interface.

Diffusion Renderers as Game Engines

  • Interactive game renderers make the renderer-simulator boundary especially sharp. A conventional game engine updates hidden state and renders pixels. A neural game renderer instead learns to generate the next frame directly from previous frames and actions.

  • Diffusion Models Are Real-Time Game Engines by Valevski et al. (2024) presents GameNGen, a neural game engine that simulates DOOM in real time by training a diffusion model to generate the next frame conditioned on previous frames and actions.

  • The learned transition is observation-level \(\hat{o}_{t+1}=D_\theta(o_{t-k:t},a_{t-k:t})\) rather than state-level \(\hat{s}_{t+1}=F_\theta(s_t,a_t)\).

  • This is why GameNGen is best categorized as an interactive renderer world model. It can appear to simulate rules, enemies, doors, health, and ammunition, but those variables are not necessarily exposed as editable symbolic state.

  • The following figure (source) shows GameNGen running DOOM at 20 FPS as an interactive neural game engine generated by a diffusion model conditioned on past frames and actions.

  • The following figure (source) overviews GameNGen method.

Diffusion World Models for Agent Training

  • Diffusion for World Modeling: Visual Details Matter in Atari by Alonso et al. (2024) introduces DIAMOND, a reinforcement-learning agent trained entirely inside a diffusion world model, arguing that preserving visual details can improve downstream control when small visual cues are task-relevant.

  • DIAMOND unrolls environment imagination autoregressively while running a denoising process at each step \(x_t^T \rightarrow x_t^{T-1} \rightarrow \dots \rightarrow x_t^0\) and then feeds the clean predicted observation into the next imagined transition. This makes it a renderer-planner bridge: the world model renders imagined observations, and the agent learns behavior inside those rendered trajectories.

  • The following figure (source) shows DIAMOND unrolling imagination over environment time while running denoising time vertically for each generated observation.

Real-Time Open-World Rendering

  • Oasis: A Universe in a Transformer presents a real-time interactive open-world renderer that takes keyboard input and generates a Minecraft-like experience with graphics, rules, and physics emerging from the model rather than a conventional physics engine.

  • Oasis is important because it highlights the latency constraint for renderer world models. A passive video generator can take seconds or minutes per clip; an interactive renderer must generate frames quickly enough to preserve the action-perception loop:

\[a_t \rightarrow \hat{o}_{t+1} \rightarrow a_{t+1}\]
  • The following figure (source) shows the architecture of Oasis, an experiential real-time open-world AI model that generates an interactive Minecraft-like video stream.

Commercial Text-to-Video Renderers

  • Large text-to-video systems are renderer world models when they generate temporally coherent observations conditioned on prompts. Runway Gen-3 Alpha: Next-Generation AI Video Generation describes a video foundation model trained for fidelity, consistency, and motion, with control modes for text-to-video, image-to-video, camera control, and temporal keyframing.

  • Mochi 1: A new SOTA in open text-to-video describes an open text-to-video model focused on high-fidelity motion and strong prompt adherence.

  • The following figures (source) shows the prompt adherence and motion quality results for Mochi 1 as an open text-to-video system.

Implementation Pattern for Interactive Renderers

  • Interactive renderer models usually add three components to a standard video generator:

    • Action encoder: maps keyboard inputs, robot actions, driving controls, or latent actions into tokens.
    • History window: conditions generation on recent frames to preserve temporal continuity.
    • Autoregressive rollout mechanism: feeds generated frames back as context for subsequent frames.
  • A generic training objective is:

\[\mathcal{L}= \mathbb{E}_{o,a,t,\epsilon} \left[ \left| \epsilon-\epsilon_\theta(o_t^\tau,\tau,o_{<t},a_{\le t},c) \right|_2^2 \right]\]
  • For token-based systems, the objective may instead be autoregressive next-token prediction:

    \[\mathcal{L}_{\text{AR}}= -\sum_t \log p_\theta(u_t\mid u_{<t},a_{\le t},c)\]
    • where \(u_t\) is a visual token. GAIA-1 uses this token-prediction pattern before decoding predicted tokens with a diffusion video decoder. GAIA-1 by Hu et al. (2023) is therefore a hybrid of autoregressive world modeling and diffusion rendering.

Failure Modes and Evaluation

  • Interactive renderer world models should be evaluated on more than visual quality. Key metrics include:

    • Action adherence: whether the rendered future reflects the control input.
    • Temporal stability: whether identities, layout, and object states persist over long rollouts.
    • Rule consistency: whether game or driving rules remain stable.
    • Recoverability: whether errors compound or self-correct.
    • Latency: whether generation is fast enough for closed-loop interaction.
    • Visual detail preservation: whether small task-relevant details survive compression.
  • This last point is important in control domains. Diffusion for World Modeling: Visual Details Matter in Atari by Alonso et al. (2024) argues that visual details lost by compact discrete latents can matter for reinforcement learning, motivating diffusion renderers as trainable environments.

Relationship to JEPA

  • Interactive renderer world models and JEPA world models optimize different contracts. A renderer predicts observations:
\[\hat{o}_{t+1}\sim p_\theta(o_{t+1}\mid o_{\le t},a_{\le t})\]
  • A JEPA predicts latent state:
\[\hat{z}_{t+1}=g_\phi(z_t,a_t)\]
  • The renderer is easier to inspect because it outputs pixels. JEPA is often more efficient for planning because it avoids rendering irrelevant detail. A mature world-model stack may combine both: a JEPA-style latent simulator for compact planning and a renderer module for visualization, data generation, or human-facing interaction.

Design Trade-offs and Evaluation

Observation Fidelity versus State Fidelity

  • Renderer world models optimize the quality of generated observations. Their natural objective is to produce frames or videos that are visually plausible, temporally coherent, and aligned with conditioning:
\[\hat{o}_{t:t+H} \sim p_\theta(o_{t:t+H}\mid o_{\le t},a_{\le t},c)\]
  • This makes them powerful for imagination, content creation, synthetic data generation, and interactive visual environments. Video Diffusion Models by Ho et al. (2022) shows that diffusion models can generate coherent video by extending image diffusion architectures to spatiotemporal data, while GAIA-1: A Generative World Model for Autonomous Driving by Hu et al. (2023) shows that video, text, and action conditioning can be combined to render controllable driving futures.

  • The central limitation is that observation fidelity does not imply state fidelity. A renderer can generate a realistic-looking scene while failing to maintain the exact hidden state required for physics, robotics, or safety-critical planning. In a simulator-first model, the primary object is instead a state transition:

\[\hat{z}_{t+1}\sim p_\theta(z_{t+1}\mid z_t,a_t)\]
  • The distinction matters because an action-conditioned renderer may appear to simulate the world, but its internal state may remain implicit, distributed, and difficult to inspect. Diffusion Models Are Real-Time Game Engines by Valevski et al. (2024) demonstrates that a diffusion model can render an interactive DOOM-like game stream in real time, but the learned engine is still observation-output-first rather than an explicit symbolic or physical simulator.

Long-Horizon Consistency

  • Long-horizon consistency is the main technical challenge for renderer world models. In autoregressive video rendering, each generated frame becomes part of the conditioning context for later frames:
\[\hat{o}_{t+1}=D_\theta(o_{t-k:t},a_{t-k:t})\]
  • Small visual or semantic errors can compound over time, causing identity drift, geometry drift, texture instability, or inconsistent object state. Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion by Chen et al. (2024) addresses this by assigning independent noise levels to sequence tokens, combining the variable-horizon flexibility of next-token prediction with the trajectory guidance benefits of full-sequence diffusion.

  • For interactive settings, long-horizon consistency is not merely an aesthetic property. It determines whether the model can preserve an agent’s location, inventory, road context, collision history, or object permanence across repeated user interventions. Genie: Generative Interactive Environments by Bruce et al. (2024) is important here because it learns a latent action interface from unlabeled internet video, allowing generated environments to be stepped through frame by frame.

Action Adherence and Controllability

  • Renderer world models become world-model-like when they respond reliably to actions. A text-to-video model can render plausible motion, but an interactive renderer must bind controls to consequences:
\[a_t \rightarrow \hat{o}_{t+1}\]
  • This requires the model to distinguish between visual correlation and causal control. GAIA-1 by Hu et al. (2023) conditions driving rollouts on ego-vehicle speed and curvature, making control adherence a core part of the generated future. Oasis: A Universe in a Transformer presents an interactive open-world model where keyboard inputs are mapped directly into a generated Minecraft-like visual stream.

  • A practical renderer evaluation should therefore measure whether the action affects the correct visual variables. Steering should change ego trajectory, jumping should change camera height and scene dynamics, breaking an object should persist in later frames, and a traffic-light edit should remain stable across the rollout.

Visual Detail as a Control Signal

  • One reason renderer world models remain relevant for control is that visual details can matter. Compact latent models may discard small cues that are irrelevant for reconstruction metrics but crucial for reward or safety. Diffusion for World Modeling: Visual Details Matter in Atari by Alonso et al. (2024) argues that diffusion world models can improve agent training when task-relevant details would otherwise be lost through overly compressed discrete latents.

  • This point creates a nuanced trade-off. JEPA-style latent simulators avoid wasting capacity on unpredictable detail, but renderer models may preserve low-level signals that matter for certain tasks. A pedestrian far away, a small traffic light, a projectile, a door handle, or an inventory icon may be visually small but decision-critical.

Evaluation Criteria for Renderer World Models

  • Renderer world models should be evaluated by a packed set of criteria: visual fidelity should measure frame-level realism and perceptual quality; temporal coherence should measure identity, layout, and object-state persistence; controllability should measure whether text, camera, action, or layout conditions reliably affect the intended variables; long-horizon stability should measure compounding error across autoregressive rollouts; physical plausibility should measure whether motion, contact, and geometry remain believable; and task utility should measure whether generated observations help downstream agents, planners, or users.

  • This broader evaluation lens is necessary because classic image and video metrics alone are insufficient. A model can score well on perceptual realism yet fail as a world model if it ignores actions, violates persistence, or produces futures that are visually plausible but causally wrong.

Relationship to JEPA

  • Renderer world models and JEPA-style world models occupy complementary positions. Renderer models optimize observation generation:
\[\mathcal{L}_{\text{renderer}} =\mathbb{E} \left[ \left| \epsilon-\epsilon_\theta(o_t^\tau,\tau,c) \right|_2^2 \right]\]
  • JEPA-style models optimize latent prediction:
\[\mathcal{L}_{\text{JEPA}} =\left| g_\phi(f_\theta(o_{\le t}),a_t)-f_{\bar{\theta}}(o_{t+1}) \right|_2^2\]
  • The renderer objective is useful when the system must show or synthesize the world. The JEPA objective is useful when the system must reason over compact predictable structure. A mature world-model stack may use both: a renderer for visualization and synthetic experience, a simulator-style latent model for efficient prediction, and a planner for goal-directed action selection.

Simulator World Models

Neural Scene and Spatial State Representations

  • Simulator world models are systems whose primary output is state rather than raw observation. A renderer asks what the world should look like; a simulator asks what the world is and how that state can be queried, transformed, rendered, or rolled forward. In spatial domains, this state may be a continuous radiance field, a Gaussian-splat scene, a mesh, a point cloud, a signed-distance field, an object layout, or a hybrid neural representation.

  • A spatial simulator can be written as:

    \[z = \mathcal{S}_\theta(o_{1:N}, c_{1:N})\]
    • where \(o_{1:N}\) are observations and \(c_{1:N}\) are camera poses, calibration parameters, or conditioning inputs. Once learned, the simulator exposes a state representation \(z\) that can be rendered from new viewpoints, edited, optimized, or combined with downstream planning systems.

Neural Radiance Fields as Continuous Scene Simulators

  • NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis by Mildenhall et al. (2020) represents a scene as a continuous function that maps 3D position and viewing direction to density and color, making it a foundational neural scene simulator for novel-view synthesis.

    \[F_\theta:(x,y,z,\theta,\phi)\rightarrow(\sigma,c)\]
    • where \(\sigma\) is volume density and \(c\) is view-dependent emitted radiance. Rendering is performed by sampling points along camera rays and integrating color through differentiable volume rendering:
\[C(r)=\int_{t_n}^{t_f} T(t)\sigma(r(t))c(r(t),d),dt\] \[T(t)=\exp\left(-\int_{t_n}^{t}\sigma(r(s)),ds\right)\]
  • This makes NeRF simulator-like because it does not merely generate a single image. It learns a continuous 3D scene function that can be queried from novel camera viewpoints. The implementation depends on camera-calibrated image collections, positional encoding for high-frequency details, hierarchical ray sampling, and optimization through photometric reconstruction loss. NeRF by Mildenhall et al. (2020) introduced positional encoding and differentiable volume rendering as practical tools for optimizing photorealistic neural scene representations from posed RGB images.

  • The following figure (source) shows the NeRF pipeline, where camera rays are sampled through a continuous 5D radiance field and accumulated with differentiable volume rendering to synthesize novel views. They use techniques from volume rendering to accumulate samples of this scene representation along rays to render the scene from any viewpoint. Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation.

  • The following figure (source) shows an overview of the NeRF scene representation and differentiable rendering procedure. We synthesize images by sampling 5D coordinates (location and viewing direction) along camera rays (a), feeding those locations into an MLP to produce a color and volume density (b), and using volume rendering techniques to composite these values into an image (c). This rendering function is differentiable, so we can optimize our scene representation by minimizing the residual between synthesized and ground truth observed images (d).

Explicit Neural Scene State through 3D Gaussian Splatting

  • 3D Gaussian Splatting for Real-Time Radiance Field Rendering by Kerbl et al. (2023) moves neural scene simulation toward an explicit point-based representation by modeling scenes as anisotropic 3D Gaussians optimized from structure-from-motion points.

  • Each Gaussian has position, opacity, covariance, and view-dependent color parameters:

\[G_i = {\mu_i,\Sigma_i,\alpha_i,c_i}\]
  • Rendering projects Gaussians into screen space and alpha-composites them in visibility order:

    \[C=\sum_i T_i \alpha_i c_i\]
    • where \(T_i\) is accumulated transmittance from earlier splats. This representation is simulator-like because the scene is no longer hidden entirely inside an MLP; it is stored as an editable set of spatial primitives that can be rendered in real time. 3D Gaussian Splatting for Real-Time Radiance Field Rendering by Kerbl et al. (2023) reports real-time rendering at 1080p using anisotropic 3D Gaussians, adaptive density control, and a visibility-aware tile rasterizer.
  • The following figure (source) shows 3D Gaussian Splatting achieving real-time rendering quality competitive with prior radiance-field methods while reducing optimization and rendering cost.

  • The following figure (source) shows the 3D Gaussian Splatting optimization pipeline, where sparse SfM points initialize Gaussians, adaptive density control refines the representation, and a differentiable tile rasterizer provides gradients for optimization.

  • An implementation typically starts from calibrated images and a sparse structure-from-motion point cloud, initializes Gaussians at point locations, optimizes opacity and spherical-harmonic color coefficients, adapts Gaussian density by splitting or pruning primitives, and minimizes a photometric loss such as:
\[\mathcal{L}=(1-\lambda)\mathcal{L}_1+\lambda\mathcal{L}_{\text{D-SSIM}}\]
  • The engineering significance is that 3D Gaussian Splatting narrows the gap between neural world representations and interactive simulators: the learned state is explicit enough for fast rendering, yet continuous enough to retain radiance-field quality.

Text-to-3D as Generative Spatial Simulation

  • DreamFusion: Text-to-3D using 2D Diffusion by Poole et al. (2022) uses a pretrained 2D text-to-image diffusion model as a prior for optimizing a 3D representation, showing that a renderer-style generative model can supervise a simulator-style 3D state without requiring large-scale labeled 3D data.

  • DreamFusion optimizes a randomly initialized NeRF so that random renderings of the 3D object are scored as likely by a text-conditioned diffusion model. Its core training mechanism is score distillation sampling:

    \[\nabla_\theta \mathcal{L}_{\text{SDS}}(\theta) =\mathbb{E}_{t,\epsilon} \left[ w(t) \left( \epsilon_\phi(x_t;y,t)-\epsilon \right) \frac{\partial x}{\partial \theta} \right]\]
    • where \(\epsilon_\phi\) is the pretrained diffusion denoiser, \(y\) is the text prompt, and \(x\) is a rendered image of the current 3D representation. DreamFusion by Poole et al. (2022) is important for simulator world models because it converts a 2D generative prior into an optimizable 3D scene state that can be viewed from arbitrary angles.
  • The following figure (source) shows the DreamFusion optimization loop, where a text-conditioned diffusion prior supplies gradients to a rendered view of a 3D model, gradually shaping a coherent text-conditioned 3D representation. DreamFusion generates 3D objects from a natural language caption such as “a DSLR photo of a peacock on a surfboard.” The scene is represented by a Neural Radiance Field that is randomly initialized and trained from scratch for each caption. Our NeRF parameterizes volumetric density and albedo (color) with an MLP. We render the NeRF from a random camera, using normals computed from gradients of the density to shade the scene with a random lighting direction. Shading reveals geometric details that are ambiguous from a single viewpoint. To compute parameter updates, DreamFusion diffuses the rendering and reconstructs it with a (frozen) conditional Imagen model to predict the injected noise \(\hat{\epsilon}_\phi\left(\mathbf{z}_t \mid y ; t\right)\). This contains structure that should improve fidelity, but is high variance. Subtracting the injected noise produces a low variance update direction stopgrad \(\left[\hat{\epsilon}_\phi-\epsilon\right]\) that is backpropagated through the rendering process to update the NeRF MLP parameters.

  • This is a bridge between renderer and simulator paradigms. The supervising model is a renderer because it evaluates images, but the optimized object is a simulator-compatible 3D state. This pattern has become central to text-to-3D, embodied simulation, game asset generation, and spatial AI pipelines.

Visual De-animation and Inverse Graphics as Simulator Construction

  • Learning to See Physics via Visual De-animation by Wu et al. (2017) frames scene understanding as recovering a physical world representation from visual input, then using physics and graphics engines to reason forward and render predicted outcomes.

  • The model decomposes visual understanding into inverse graphics, physical-state estimation, forward simulation, and rendering:

\[o_{1:T}\rightarrow z_{\text{phys}}\rightarrow \hat{z}_{T+1:T+H}\rightarrow \hat{o}_{T+1:T+H}\]
  • This is simulator-first because the core latent object is physical state: positions, velocities, masses, friction, shape, viewpoint, and scene layout. The perception module infers this state, the physics engine rolls it forward, and the graphics engine renders outcomes for reconstruction or prediction. Learning to See Physics via Visual De-animation by Wu et al. (2017) shows how inverse graphics and physics simulation can be combined so that visual prediction is mediated by an interpretable physical representation.

  • The following figure (source) shows visual de-animation, where the system recovers the physical world representation behind visual input and combines it with physics simulation and rendering engines.

  • The following figure (source) shows the visual de-animation framework, including inverse graphics, physical state recovery, physics-based future prediction, and rendering-based reconstruction. Specifically, visual de-animation (VDA) model contains three major components: a convolutional perception module (I), a physics engine (II), and a graphics engine (III). The perception module efficiently inverts the graphics engine by inferring the physical object state for each segment proposal in input (a), and combines them to obtain a physical world representation (b). The generative phyand graphics engines then run forward to reconstruct the visual data (e).

  • A practical version of this pipeline uses an object detector or proposal generator, a neural network for object and physical-parameter inference, a differentiable or non-differentiable physics engine, and a rendering loss that forces the inferred state to explain observations. In differentiable settings, gradients can flow from image reconstruction through rendering into physical-state estimates. In non-differentiable settings, state inference may rely on learned approximations, search, or surrogate gradients.

Implementation Pattern for Spatial Simulator World Models

  • A spatial simulator world model generally follows a packed implementation pattern: it first defines a state representation such as a radiance field, Gaussian splat field, mesh, point cloud, or object-physical state; it then defines a differentiable or approximately differentiable renderer that maps state to observations; it optimizes the state or neural parameters against observed images, videos, poses, or text-conditioned priors; and it exposes the learned state for novel-view rendering, editing, simulation, or downstream planning.

  • The generic objective is:

    \[\mathcal{L}_{\text{sim-state}} =\mathcal{L}_{\text{render}}(R_\psi(z),o) +\lambda\mathcal{L}_{\text{state-prior}}(z) +\gamma\mathcal{L}_{\text{consistency}}(z)\]
    • where \(R_\psi\) is a renderer, \(z\) is the learned scene state, and the regularization terms encode priors such as smoothness, sparsity, geometric consistency, multi-view consistency, or physical plausibility.

Learned Physical Dynamics and Relational Simulation

From Static Scene State to Dynamic State Transition

  • A spatial scene representation becomes a simulator when it can predict how the state evolves under time, forces, contacts, constraints, or actions. The core simulator equation is:

    \[\hat{z}_{t+1}=F_\theta(z_t,a_t)\]
    • where \(z_t\) is the current world state and \(a_t\) may represent an external action, control input, force, boundary condition, or intervention. In a physical simulator, the state must preserve quantities that support prediction, such as position, velocity, mass, material properties, object identity, contact state, and relation structure.
  • Learned physical simulators differ from renderer world models because they do not primarily optimize for image realism. Their target is state fidelity: the predicted future state should obey the learned dynamics of interacting entities, fluids, cloth, rigid bodies, deformable materials, or meshes.

Interaction Networks and Object-Relation Simulation

  • Interaction Networks for Learning about Objects, Relations and Physics by Battaglia et al. (2016) introduced a neural framework for reasoning over objects and relations, where the model takes object states and relation attributes as input, computes interaction effects, and applies learned object dynamics to predict future states. ([NeurIPS Proceedings][1])

  • The state is naturally graph-structured:

    \[G_t=(V_t,E_t)\]
    • where each node \(v_i \in V_t\) represents an object and each edge \(e_{ij}\in E_t\) represents a relation or interaction. The model computes relation effects and aggregates them into object updates:
\[e_{ij}'=\phi_e(v_i,v_j,e_{ij})\] \[\bar{e}_i=\sum_{j} e_{ij}'\] \[v_i'=\phi_v(v_i,\bar{e}_i)\]
  • This architecture is simulator-like because it mirrors the compositional structure of physical systems: objects interact through relations, and future state emerges from those interactions. Interaction Networks by Battaglia et al. (2016) showed that object-relation neural computation can simulate n-body systems, rigid-body collisions, and non-rigid dynamics while generalizing across different object configurations.

  • The following figure (source) shows an interaction network, where objects and relations are encoded, interaction effects are computed, and object dynamics are applied to produce physical predictions. Specifically: a. For physical reasoning, the model takes objects and relations as input, reasons about their interactions, and applies the effects and physical dynamics to predict new states. b. For more complex systems, the model takes as input a graph that represents a system of objects, \(o_j\), and relations, \(\left\langle i, j, r_k\right\rangle_k\), instantiates the pairwise interaction terms, \(b_k\), and computes their effects, \(e_k\), via a relational model, \(f_R(\cdot)\). The \(e_k\) are then aggregated and combined with the \(o_j\) and external effects, \(x_j\), to generate input (as \(c_j\)), for an object model, \(f_O(\cdot)\), which predicts how the interactions and dynamics influence the objects, \(p\).

Visual Interaction Networks and Simulation from Video

  • Visual Interaction Networks: Learning a Physics Simulator from Video by Watters et al. (2017) extends object-relation simulation to raw visual input by using a perceptual front-end to infer latent object states and an interaction network to roll those states forward.

  • The model can be summarized as:

    \[o_{1:k}\rightarrow {z_k^{(1)},z_k^{(2)},\dots,z_k^{(N)}}\rightarrow \hat{z}_{k+1:k+H}\]
    • where the first stage parses visual evidence into object-centric latent states and the second stage performs relational dynamics prediction. Visual Interaction Networks by Watters et al. (2017) is important because it connects perception to learned simulation: the model predicts physical trajectories from video rather than requiring direct access to simulator state.
  • The following figure (source) shows the Visual Interaction Network architecture, where a convolutional perceptual front-end infers object states from video and an interaction network predicts future physical trajectories.

  • This pattern remains central to simulator world models. A perception module constructs latent state, a relational dynamics module rolls state forward, and a decoder or evaluator compares predictions to future observations or ground-truth state. Unlike pure video renderers, the goal is not only to synthesize plausible frames; it is to preserve the latent variables that govern future physical behavior.

Graph Network-Based Simulators

  • Learning to Simulate Complex Physics with Graph Networks by Sanchez-Gonzalez et al. (2020) generalizes learned simulation to particle-based physical systems by representing particles as graph nodes and computing dynamics through message passing. ([Proceedings of Machine Learning Research][3])

  • A Graph Network-based Simulator represents each particle or material element as a node:

\[v_i^t = [x_i^t,\dot{x}_i^t,m_i,\text{material}_i,\dots]\]
  • Edges connect nearby particles or interacting elements:
\[e_{ij}^t = [x_i^t-x_j^t,|x_i^t-x_j^t|,\dots]\]
  • Message passing then computes local interaction effects, aggregates them, and predicts accelerations or position updates:
\[m_{ij}=\phi_e(v_i,v_j,e_{ij})\] \[\bar{m}_i=\sum_{j\in \mathcal{N}(i)}m_{ij}\] \[\Delta v_i=\phi_v(v_i,\bar{m}_i)\]
  • The simulator is rolled out autoregressively:
\[\hat{z}_{t+1}=F_\theta(\hat{z}_t)\]
  • Learning to Simulate Complex Physics with Graph Networks by Sanchez-Gonzalez et al. (2020) demonstrated learned simulation across fluids, rigid solids, and deformable materials, and found that noise corruption during training improves robustness to rollout error.

  • The following figure (source) shows the Graph Network-based Simulator framework, where particle states are represented as graph nodes and learned message passing predicts physical evolution. Specifically: (a) The GNS predicts future states represented as particles using its learned dynamics model, \(d_\theta\), and a fixed update procedure. (b) The \(d_\theta\) uses an “encode-process-decode” scheme, which computes dynamics information, \(Y\), from input state, \(X\). (c) The encoder constructs latent graph, \(G^0\), from the input state, \(X\). (d) The processor performs \(M\) rounds of learned message-passing over the latent graphs, \(G^0, \ldots, G^M\). (e) The decoder extracts dynamics information, \(Y\), from the final latent graph, \(G^M\).

MeshGraphNets and Scientific Simulation

  • Learning Mesh-Based Simulation with Graph Networks by Pfaff et al. (2020) extends graph simulation to mesh-based physical systems, using graph neural networks over adaptive meshes for domains such as aerodynamics, structural mechanics, and cloth simulation.

  • Mesh simulators differ from particle simulators because the graph structure is not only a set of local neighbors; it is a discretization of an underlying physical domain. Nodes represent mesh vertices, edges represent mesh connectivity, and attributes encode geometry, boundary conditions, material state, and dynamic quantities.

\[G_t=(V_t,E_{\text{mesh}},E_{\text{world}})\]
  • MeshGraphNets uses both mesh edges and world-space proximity edges, allowing the model to combine discretization-aware local computation with interaction between nearby physical elements. Learning Mesh-Based Simulation with Graph Networks by Pfaff et al. (2020) reports accurate learned rollouts across complex systems and notes that learned mesh simulators can run substantially faster than the numerical solvers used to generate their training data.

  • The following figure (source) shows MeshGraphNets (operating on their SphereDynamic domain), where simulation state is encoded on a mesh graph, processed through message passing, and decoded into updated physical quantities. The model uses an Encode-Process-Decode architecture trained with one-step supervision, and can be applied iteratively to generate long trajectories at inference time. The encoder transforms the input mesh \(M^t\) into a graph, adding extra world-space edges. The processor performs several rounds of message passing along mesh edges and world edges, updating all node and edge embeddings. The decoder extracts the acceleration for each node, which is used to update the mesh to produce \(M^{t+1}\).

  • Mesh-based learned simulators are particularly important for engineering because mesh resolution can adapt to regions requiring precision, such as boundary layers in fluid flow, contact regions in cloth, and stress concentrations in deformable structures. This makes them a natural simulator-world-model paradigm for scientific and industrial domains.

Implementation Pattern for Learned Physical Simulators

  • A learned physical simulator generally follows a dense implementation pattern: choose a state representation that exposes relevant physical variables, construct a graph from objects, particles, or mesh elements, encode node and edge attributes with neural networks, perform multiple message-passing steps to approximate local interactions, decode accelerations or state deltas, integrate the predicted dynamics through time, and train on one-step or multi-step prediction losses while adding noise or rollout training to reduce compounding error.

  • A generic one-step objective is:

\[\mathcal{L}_{\text{1-step}} =\left| F_\theta(z_t,a_t)-z_{t+1} \right|_2^2\]
  • A rollout objective is:

    \[\mathcal{L}_{\text{rollout}} =\sum_{k=1}^{H} \left| \hat{z}_{t+k}-z_{t+k} \right|_2^2\]
    • where:
    \[\hat{z}_{t+k+1}=F_\theta(\hat{z}_{t+k},a_{t+k})\]
  • The rollout loss is more expensive but better aligned with simulator use, because downstream planners care about long-horizon accuracy rather than isolated one-step predictions.

Error Accumulation and Stabilization

  • Learned simulators face the same compounding-error problem as interactive renderers, but state-space errors are often more consequential. A small velocity error can lead to large position drift; a small contact error can change the outcome of a collision; a small pressure error can destabilize a fluid rollout.

  • Practical stabilization methods include training with corrupted inputs so the model learns to recover from off-manifold states, adding multi-step rollout losses so the model is exposed to its own predictions, enforcing conservation-inspired constraints when known, normalizing state variables and edge features to improve optimization, and using graph locality to preserve physical inductive bias. Learning to Simulate Complex Physics with Graph Networks by Sanchez-Gonzalez et al. (2020) identifies message-passing depth and training-time noise corruption as major determinants of long-term simulation quality. ([arXiv][5])

Relationship to Renderer Models and JEPA

  • Learned physical simulators differ from renderer world models in the object they optimize. Renderers produce observations:
\[\hat{o}_{t+1}\sim p_\theta(o_{t+1}\mid o_{\le t},a_{\le t})\]
  • Simulators produce state:
\[\hat{z}_{t+1}=F_\theta(z_t,a_t)\]
  • JEPA sits close to the simulator paradigm because it also predicts latent state rather than reconstructing pixels, but graph simulators are usually more explicitly structured: their state is object, particle, or mesh based, and their dynamics are organized around relations. This makes graph simulators highly interpretable and physically grounded, while JEPA-style latent simulators are often more scalable to raw sensory data and less dependent on manually specified state variables.

  • A mature world-model stack may combine these approaches: a perception system or JEPA encoder constructs latent state from observations, a graph simulator rolls forward structured dynamics, and a renderer decodes selected states into visual observations for inspection, training, or human interaction.

Simulator World Models: Evaluation, Interfaces, and Integration

Simulator Interfaces

  • A simulator world model should expose a state interface that can be queried, updated, and evaluated. Unlike renderer models, which produce observations, simulator models should preserve variables that matter for future prediction:

    \[z_t = {x_t, v_t, m, r, c, \rho, \mathcal{G}_t}\]
    • where \(x_t\) may denote positions, \(v_t\) velocities, \(m\) masses, \(r\) object relations, \(c\) contacts, \(\rho\) material parameters, and \(\mathcal{G}_t\) graph structure. The exact state depends on the domain: NeRF-style models expose continuous radiance fields, 3D Gaussian Splatting exposes explicit Gaussian primitives, graph simulators expose particles or objects, and mesh simulators expose discretized physical fields.
  • NeRF by Mildenhall et al. (2020) exposes a continuous 5D radiance field useful for view synthesis, while 3D Gaussian Splatting by Kerbl et al. (2023) exposes editable spatial primitives that make real-time rendering more practical. Learning to Simulate Complex Physics with Graph Networks by Sanchez-Gonzalez et al. (2020) exposes particle states as graph nodes and uses message passing to predict physical evolution.

State Accuracy and Rollout Accuracy

  • Simulator evaluation should distinguish one-step accuracy from rollout accuracy. One-step prediction measures whether the model can estimate the immediate next state:
\[\mathcal{L}_{\text{1-step}} =\left| F_\theta(z_t,a_t)-z_{t+1} \right|_2^2\]
  • Rollout accuracy measures whether the simulator remains stable under its own predictions:

    \[\mathcal{L}_{\text{rollout}} = \sum_{k=1}^{H} \left| \hat{z}_{t+k}-z_{t+k} \right|_2^2\]
    • where:

      \[\hat{z}_{t+k+1}=F_\theta(\hat{z}_{t+k},a_{t+k})\]
  • This distinction is crucial because a model can have low one-step error yet fail when rolled out for many steps. Learning to Simulate Complex Physics with Graph Networks by Sanchez-Gonzalez et al. (2020) emphasizes long-horizon rollout robustness and shows that training-time noise corruption helps the model recover from off-distribution prediction errors.

Physical Plausibility

  • A simulator should satisfy physical plausibility constraints whenever the domain has known structure. These constraints may include conservation of mass, bounded energy drift, collision consistency, material constraints, mesh validity, and contact stability. In learned simulators, these constraints can be enforced explicitly through the architecture, softly through regularization, or implicitly through training data.

  • A generic physically regularized loss is:

\[\mathcal{L} =\mathcal{L}_{\text{pred}} + \lambda_E \mathcal{L}_{\text{energy}} + \lambda_C \mathcal{L}_{\text{contact}} + \lambda_B \mathcal{L}_{\text{boundary}}\]

Editability and Counterfactual Validity

  • A simulator should support counterfactual changes. If the mass of an object changes, the model should predict a different trajectory. If a force is applied, the model should update the future state accordingly. If a camera viewpoint changes, a spatial simulator should render the same underlying scene from the new view.

  • Counterfactual validity can be expressed as:

\[z_t'=\text{Intervene}(z_t,\delta)\] \[\hat{z}_{t+1}'=F_\theta(z_t',a_t)\]
  • The simulator should respond consistently to \(\delta\). Learning to See Physics via Visual De-animation by Wu et al. (2017) is important here because it recovers physical world state from vision and then uses physics and graphics engines for prediction and reasoning, making counterfactual physical inference part of the simulator interface.

Rendering as a Diagnostic, Not the Whole Objective

  • Many simulator world models include a renderer:
\[\hat{o}_t=R_\psi(z_t)\]
  • Rendering is useful because observations provide supervision and allow humans to inspect predicted states. However, a good rendered image is not sufficient proof of a good simulator. The latent state may still be geometrically inconsistent, physically invalid, or unstable under intervention.

  • This is the key distinction from renderer world models. A renderer can be evaluated by visual quality, but a simulator must be evaluated by whether its internal state remains valid. DreamFusion by Poole et al. (2022) illustrates this boundary: a 2D diffusion prior supervises a 3D representation through rendered views, but the target object is a 3D state that can be viewed, relit, and composed into 3D environments.

Integration with Planners

  • Simulator world models become decision-relevant when a planner can use them to evaluate possible actions. Given a learned dynamics model \(\hat{z}_{t+1}=F_\theta(z_t,a_t)\), a planner can search for an action sequence that minimizes a goal-conditioned cost:
\[a_{t:t+H}^* =\arg\min_{a_{t:t+H}} \sum_{k=1}^{H} C(\hat{z}_{t+k},z_g)\]
  • The simulator need not render every candidate future. It only needs to provide a reliable state rollout and a cost-relevant representation. This is why simulator world models are often more efficient than renderer world models for control.

  • A dense evaluation criterion for simulator-planner integration should measure whether simulated trajectories preserve goal-relevant state, whether the planner’s selected actions transfer to the real or target environment, whether rollout errors compound under closed-loop replanning, and whether the simulator supports interventions outside the exact training distribution.

Relationship to JEPA

  • Simulator world models and JEPA world models are closely related because both prioritize state prediction over pixel reconstruction. The difference is the explicitness of the state. Graph simulators and mesh simulators represent state as objects, particles, relations, or mesh fields. JEPA represents state as learned embeddings:
\[z_t=f_\theta(o_t)\] \[\hat{z}_{t+1}=g_\phi(z_t,a_t)\]
  • This makes JEPA more scalable to unstructured sensory data, but often less inspectable than object-centric or mesh-based simulators. A strong world-model architecture may combine both approaches: JEPA-style encoders can learn compact predictive representations from raw video, while graph or mesh simulators can impose relational and physical structure where explicit state is available.

Planner World Models

Latent Imagination and Model-Based Control

  • Planner world models are systems whose primary output is action. A renderer predicts observations, a simulator predicts state, and a planner chooses interventions that are expected to achieve a goal. In learned world-model planning, the agent first learns a predictive model of the environment, then uses that model to evaluate candidate futures:

    \[a_{t:t+H}^{*} =\arg\max_{a_{t:t+H}} \mathbb{E} \left[ \sum_{k=0}^{H} \gamma^k r(\hat{z}_{t+k},a_{t+k}) \right]\]
    • where \(\hat{z}_{t+k}\) is a predicted latent state and \(H\) is the planning horizon. World Models by Ha and Schmidhuber (2018) established the neural world-model framing in which an agent learns compressed spatial and temporal representations, then trains a compact controller using those learned features.
  • The following figure (source) shows the World Models pipeline, which consists of three components that work closely together: Vision (V), Memory (M), and Controller (C). Visual observations are compressed by a VAE, temporal dynamics are modeled by an MDN-RNN, and a compact controller acts using the learned latent state.

Planning from Pixels with Latent Dynamics

  • The central challenge in planner world models is that raw observations are too high-dimensional for direct planning. A planner should not search over pixels; it should search over compact latent states that preserve reward-relevant dynamics.

  • Learning Latent Dynamics for Planning from Pixels by Hafner et al. (2019) introduced PlaNet, a model-based agent that learns a recurrent state-space model from images and chooses actions through online planning in latent space.

  • PlaNet uses a latent dynamics model with deterministic and stochastic components:

\[h_t=f_\theta(h_{t-1},z_{t-1},a_{t-1})\] \[z_t \sim p_\theta(z_t\mid h_t)\] \[\hat{o}_t \sim p_\theta(o_t\mid h_t,z_t)\]
  • The deterministic state \(h_t\) preserves recurrent memory, while the stochastic state \(z_t\) represents uncertainty and partial observability. Planning then uses model predictive control, commonly with the cross-entropy method, to sample candidate action sequences, roll them forward in latent space, score predicted rewards, and execute the first action before replanning.

  • The following figure (source) shows PlaNet learning latent dynamics from image observations and using online planning in compact latent space to choose actions. Specifically, it shows the image-based control domains used in their experiments. The images show agent observations before downscaling to \(64 \times 64 \times 3\) pixels. (a) The cartpole swingup task has a fixed camera so the cart can move out of sight. (b) The reacher task has only a sparse reward. (c) The cheetah running task includes both contacts and a larger number of joints. (d) The finger spinning task includes contacts between the finger and the object. (e) The cup task has a sparse reward that is only given once the ball is caught. (f) The walker task requires balance and predicting difficult interactions with the ground when the robot is lying down.

  • A compact implementation pattern is:
belief = encoder.observe(history)
for iteration in range(num_cem_iters):
    action_sequences = sample_action_sequences(distribution)
    imagined_states = world_model.rollout(belief, action_sequences)
    returns = reward_model(imagined_states).sum(dim="time")
    distribution = refit_to_elite_sequences(action_sequences, returns)
action = distribution.mean[0]
  • The important design choice is that planning happens inside the learned latent model, not in observation space. This is why PlaNet belongs to the planner branch even though it learns a simulator internally: the learned simulator exists to support action selection.

Latent Imagination and Policy Learning

  • Online planning can be computationally expensive because it repeatedly samples and evaluates action sequences at decision time. Dreamer shifts the emphasis from online search to policy learning inside imagined latent trajectories.

  • Dream to Control: Learning Behaviors by Latent Imagination by Hafner et al. (2019) introduced Dreamer, which learns long-horizon behaviors by backpropagating value estimates through trajectories imagined in the compact latent state space of a learned world model.

  • Dreamer learns three coupled components:

\[\text{world model: } p_\theta(z_{t+1}\mid z_t,a_t)\] \[\text{actor: } a_t\sim \pi_\phi(a_t\mid z_t)\] \[\text{critic: } v_\psi(z_t)\approx \mathbb{E}\left[\sum_{k\geq0}\gamma^k r_{t+k}\right]\]
  • The actor is trained on imagined rollouts rather than only real environment transitions:

    \[\mathcal{L}_{\text{actor}} =-\mathbb{E} \left[ \sum_{t=1}^{H} V_\lambda(z_t) \right]\]
    • where \(V_\lambda\) is a bootstrapped return estimate computed from imagined rewards and critic values. Dream to Control by Hafner et al. (2019) is central because it shows that a planner world model can become a behavior-learning system: the model generates imagined futures, and the policy improves by differentiating through those futures.
  • The following figure (source) shows Dreamer learning a world model from experience and learning behaviors by propagating value estimates through imagined latent trajectories.

Uncertainty-Aware Planning with Probabilistic Dynamics

  • Planning becomes risky when the learned model is uncertain. If a planner exploits model errors, it may choose actions that look good inside the model but fail in the real environment. Probabilistic dynamics models address this by representing uncertainty over transitions.

  • PETS: Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models by Chua et al. (2018) combines probabilistic neural network ensembles with trajectory sampling, enabling model predictive control that accounts for epistemic and aleatoric uncertainty.

  • A probabilistic ensemble dynamics model can be written as:

    \[p_\theta(s_{t+1}\mid s_t,a_t) =\frac{1}{M} \sum_{m=1}^{M} p_{\theta_m}(s_{t+1}\mid s_t,a_t)\]
    • where different ensemble members represent model uncertainty. Planning then evaluates action sequences under sampled futures rather than a single deterministic rollout:

      \[J(a_{t:t+H}) =\mathbb{E}_{p_\theta} \left[ \sum_{k=0}^{H} r(s_{t+k},a_{t+k}) \right]\]
  • This makes PETS a useful bridge between simulator and planner paradigms: the simulator is not only predictive, but also uncertainty-aware, and the planner uses that uncertainty when choosing actions.

Why Planner World Models Matter

  • Planner world models are the point where world modeling becomes agency. A renderer can show possible futures, and a simulator can roll forward state, but a planner decides what to do. The planning objective converts prediction into intervention:
\[\text{prediction} \rightarrow \text{evaluation} \rightarrow \text{action}\]
  • A practical planning world model should therefore satisfy a dense set of requirements: it should learn compact states that preserve reward-relevant information, predict futures accurately enough over the planner’s horizon, represent uncertainty when the future is ambiguous, avoid exploiting model errors, support efficient candidate-action evaluation, and improve policies using imagined experience rather than only real interaction.

Search, Task-Oriented Latent Models, and Scalable Control

Planning-Relevant Models Rather than Complete Simulators

  • A planning world model does not need to reconstruct every aspect of the environment. It needs to preserve the aspects of the future that change the ranking of candidate actions. This motivates a task-oriented objective:
\[z_t = e_\theta(o_{\leq t})\] \[\hat{z}_{t+1}=d_\theta(z_t,a_t)\] \[\hat{r}_t=r_\theta(z_t,a_t)\] \[\hat{v}_t=v_\theta(z_t)\]
  • The learned state \(z_t\) is valuable when it supports accurate reward, value, and action evaluation, even if it cannot reconstruct the original observation. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model by Schrittwieser et al. (2020) formalized this principle in MuZero: its model predicts reward, policy, and value quantities relevant to search, without requiring observation reconstruction or access to environment rules.

  • This distinction separates planning world models from generative simulators. A generative model is trained to explain observations:

\[p_\theta(o_{t+1}\mid o_{\leq t},a_t)\]
  • A planning model is trained to preserve decision-relevant quantities:
\[p_\theta(r_{t:t+H},v_{t:t+H},\pi_{t:t+H}\mid z_t,a_{t:t+H})\]
  • The second objective can be substantially easier because it ignores visual details, stochastic nuisance variables, and environmental structure that do not affect the agent’s decision. MuZero explicitly learns a hidden state that is free to represent whatever internal structure best supports accurate planning, rather than matching a true environment state or reconstructing observations.

MuZero and Search over Learned Latent States

  • MuZero combines a learned latent dynamics model with Monte Carlo Tree Search. It contains a representation function, a dynamics function, and a prediction function:

    \[s_t^0=h_\theta(o_{1:t})\] \[r_t^k,s_t^k=g_\theta(s_t^{k-1},a_t^k)\] \[p_t^k,v_t^k=f_\theta(s_t^k)\]
    • where \(h_\theta\) maps observation history into a latent root state, \(g_\theta\) predicts the next latent state and immediate reward under a hypothetical action, and \(f_\theta\) predicts policy logits and value. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model by Schrittwieser et al. (2020) shows that a learned planning model can support superhuman performance in Go, chess, and shogi while also achieving strong Atari results without being given game rules.
  • The planning loop evaluates a search tree using learned priors and values:

    \[a_t^*=\arg\max_a N(s_t,a)\]
    • where \(N(s_t,a)\) is the visit count assigned by search. A typical selection score combines action value, exploration pressure, and policy prior:

      \[U(s,a) =Q(s,a) + c_{\text{puct}} P(s,a) \frac{\sqrt{\sum_b N(s,b)}}{1+N(s,a)}\]
  • The central insight is that the model need not simulate observations. It only needs to generate latent transitions that let search compare action branches. This is particularly effective in discrete-action domains, where tree search can efficiently expand and revisit promising branches.

  • The following figure (source) shows MuZero’s learned planning model, where an observation is encoded into a hidden state, hypothetical actions are applied through recurrent latent dynamics, and each latent state predicts reward, policy, and value for tree search. Specifically, it shows planning, acting, and training with a learned model. (A) How MuZero uses its model to plan. The model consists of three connected components for representation, dynamics and prediction. Given a previous hidden state \(s^{k-1}\) and a candidate action \(a^k\), the dynamics function \(g\) produces an immediate reward \(r^k\) and a new hidden state \(s^k\). The policy \(p^k\) and value function \(v^k\) are computed from the hidden state \(s^k\) by a prediction function \(f\). The initial hidden state \(s^0\) is obtained by passing the past observations (e.g. the Go board or Atari screen) into a representation function \(h\). (B) How MuZero acts in the environment. A Monte-Carlo Tree Search is performed at each timestep \(t\), as described in A . An action \(a_{t+1}\) is sampled from the search policy \(\pi_t\), which is proportional to the visit count for each action from the root node. The environment receives the action and generates a new observation \(o_{t+1}\) and reward \(u_{t+1}\). At the end of the episode the trajectory data is stored into a replay buffer. (C) How MuZero trains its model. A trajectory is sampled from the replay buffer. For the initial step, the representation function \(h\) receives as input the past observations \(o_1, \ldots, o_t\) from the selected trajectory. The model is subsequently unrolled recurrently for \(K\) steps. At each step \(k\), the dynamics function \(g\) receives as input the hidden state \(s^{k-1}\) from the previous step and the real action \(a_{t+k}\). The parameters of the representation, dynamics and prediction functions are jointly trained, end-to-end by backpropagation-through-time, to predict three quantities: the policy \(\mathbf{p}^k \approx \pi_{t+k}\), value function \(v^k \approx z_{t+k}\), and reward \(r_{t+k} \approx u_{t+k}\), where \(z_{t+k}\) is a sample return: either the final reward (board games) or $$n$-step return (Atari).

Model Predictive Control with Task-Oriented Latent Dynamics

  • Continuous-control planning often cannot enumerate actions through tree search. Instead, it samples and refines candidate action trajectories. Temporal Difference Learning for Model Predictive Control by Hansen et al. (2022) introduced TD-MPC, which combines short-horizon trajectory optimization in a task-oriented latent model with a learned terminal value function.

  • TD-MPC learns an encoder, latent transition model, reward model, value function, and policy prior:

\[z_t=e_\theta(o_t)\] \[z_{t+1}=d_\theta(z_t,a_t)\] \[\hat{r}_t=R_\theta(z_t,a_t)\] \[\hat{Q}_t=Q_\theta(z_t,a_t)\]
  • The planner evaluates an action sequence using short-horizon model rollouts and a terminal value estimate:

    \[\phi(\Gamma) =\sum_{k=0}^{H-1} \gamma^k R_\theta(z_k,a_k) + \gamma^H Q_\theta(z_H,a_H)\]
    • where:

      \[z_{k+1}=d_\theta(z_k,a_k)\]
  • This hybrid objective is important because it reduces the need for very long model rollouts. The latent model handles local trajectory optimization, while the terminal value function estimates long-range consequences beyond the planning horizon. Temporal Difference Learning for Model Predictive Control by Hansen et al. (2022) argues that this task-oriented formulation avoids spending model capacity on irrelevant visual details while preserving the quantities required for continuous control.

  • The following figure (source) shows TD-MPC combining a task-oriented latent dynamics model, reward prediction, terminal value estimation, policy guidance, and model predictive trajectory optimization. (Top) A framework for MPC is presented using a task-oriented latent dynamics model and value function learned jointly by temporal difference learning. We perform trajectory optimization over model rollouts and use the value function for long-term return estimates. (Bottom) Episode return of our method, SAC, and MPC with a ground-truth simulator on challenging, high dimensional Humanoid and Dog tasks. Mean of 5 runs; shaded areas are 95% confidence intervals.

Sampling-Based Trajectory Optimization

  • TD-MPC uses Model Predictive Path Integral control to optimize continuous action sequences. The planner samples candidate trajectories from a time-indexed Gaussian distribution:
\[a_t^{(i)} \sim \mathcal{N} \left( \mu_t, \sigma_t^2 I \right)\]
  • Each sampled trajectory is rolled out in latent space and scored by predicted return. The mean and variance are then updated from high-scoring samples:

    \[\mu^{j} =\frac{ \sum_{i=1}^{K} \Omega_i \Gamma_i^{*} }{ \sum_{i=1}^{K} \Omega_i }\] \[\sigma^{j} =\sqrt{ \frac{ \sum_{i=1}^{K} \Omega_i \left( \Gamma_i^{*}-\mu^j \right)^2 }{ \sum_{i=1}^{K} \Omega_i } }\]
    • where \(\Omega_i\) weights elite trajectories according to predicted return. Temporal Difference Learning for Model Predictive Control by Hansen et al. (2022) uses this sampling-based planning procedure together with a learned policy prior, allowing planning to focus on locally promising trajectories rather than uniformly exploring the full continuous action space.
  • This planning mechanism illustrates an important distinction between MuZero and TD-MPC. MuZero searches discrete action trees using MCTS. TD-MPC searches continuous action sequences using trajectory sampling and distribution refinement. Both rely on learned latent models, but their planning algorithms match different action-space structures.

Scaling Task-Oriented Planning with TD-MPC2

  • TD-MPC2: Scalable, Robust World Models for Continuous Control by Hansen et al. (2024) extends task-oriented latent planning toward multi-task and multi-domain control, using a single set of hyperparameters across diverse continuous-control tasks.

  • TD-MPC2 retains the basic structure of latent model predictive control but strengthens representation normalization, reward and value learning, policy priors, and multi-task conditioning. Its latent state uses SimNorm, a normalization mechanism that biases representations toward sparse, bounded structure:

\[z_t=\text{SimNorm}(e_\theta(o_t))\]
  • The model additionally uses task embeddings:

    \[z_t=e_\theta(o_t,\tau)\]
    • where \(\tau\) is a learned task representation that conditions the encoder, dynamics model, reward model, value functions, and policy. This enables a single agent to operate across tasks with different embodiments, observation spaces, and action spaces. TD-MPC2 by Hansen et al. (2024) reports scaling a single 317-million-parameter agent across 80 continuous-control tasks and evaluates a shared configuration across 104 tasks.
  • The following figure (source) shows the TD-MPC2 architecture. Observations s are encoded into their (normalized) latent representation \(z\). The model then recurrently predicts actions \(a\mathbin{\text{\^}}\), rewards \(r\mathbin{\text{\^}}\), and terminal values \(q\mathbin{\text{\^}}\), without decoding future observations.

  • TD-MPC2 also revises the reward and value objectives by using discretized regression in a transformed reward space, reducing sensitivity to task-dependent reward magnitudes. This matters for multi-task planning because raw reward scales can vary sharply across domains, destabilizing a shared model.

Discrete Latent Planning and DreamerV2

  • Mastering Atari with Discrete World Models by Hafner et al. (2021) introduced DreamerV2, which uses discrete stochastic latent variables to improve world-model learning and trains behavior entirely through imagined trajectories.

  • DreamerV2 represents latent state with categorical variables:

    \[z_t =\left[ z_t^{(1)}, z_t^{(2)}, \dots, z_t^{(K)} \right]\]
    • where each component is sampled from a categorical distribution. Discrete latent variables can make the model less sensitive to small continuous-state drift and improve its ability to represent multimodal futures. Mastering Atari with Discrete World Models by Hafner et al. (2021) reports that discrete latents and KL balancing are important contributors to its Atari performance, while the learned world model supports policy optimization entirely in imagined experience.
  • The planning significance is that DreamerV2 trades explicit online search for extensive latent policy optimization. It learns a world model, imagines many trajectories in parallel, updates actor and critic networks from those trajectories, and executes the learned policy in the environment. This makes it especially useful when decision latency must remain low at inference time.

Planning Architecture Trade-offs

  • Planner world models can be organized by how they allocate computation. MuZero spends substantial test-time computation on search, making it suitable for discrete domains with deep combinatorial structure. TD-MPC spends test-time computation on short-horizon continuous trajectory optimization, making it suitable for continuous control and receding-horizon decision making. Dreamer-style agents spend more computation during training to improve an amortized policy, making inference efficient. PETS spends computation on uncertainty-aware trajectory sampling, making it useful when model uncertainty is central.

  • A practical architecture choice should consider the action space, planning horizon, environment stochasticity, model uncertainty, inference latency, and whether task-relevant value functions can compensate for limited rollout horizons. The key design principle remains:

\[\text{model complexity} + \text{planning computation} + \text{value estimation} =\text{decision quality}\]

Evaluation and Failure Modes

Planning Evaluation Should Match the Decision Loop

  • A planner world model should be evaluated by the quality of the actions it produces, not only by the accuracy of its predictions. A learned model can have plausible rollouts but still produce poor actions if its errors affect reward-relevant variables, if its value estimates are miscalibrated, or if the planner exploits model inaccuracies.

  • The relevant objective is closed-loop return:

\[J(\pi)= \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]\]
  • rather than only one-step prediction loss:
\[\mathcal{L}_{\text{pred}} =\left| \hat{z}_{t+1}-z_{t+1} \right|^2\]

Model Exploitation and Reward Misgeneralization

  • A planner searches for actions that maximize predicted value. If the world model is wrong, the planner may exploit the error. This is model exploitation:
\[a^*= \arg\max_a \hat{Q}_\theta(z,a) \quad \text{while} \quad Q(z,a)\ll \hat{Q}_\theta(z,a)\]
  • The risk is greatest when the planner evaluates out-of-distribution action sequences, when rollouts are long, or when the value model extrapolates beyond its training support. PETS: Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models by Chua et al. (2018) addresses this by using probabilistic ensembles and trajectory sampling, so the planner reasons over uncertainty rather than trusting a single deterministic model.

  • A dense evaluation protocol should therefore measure policy return, calibration of uncertainty, sensitivity to out-of-distribution actions, degradation under longer planning horizons, and the gap between predicted and realized returns.

Planning Horizon and Compounding Error

  • Longer planning horizons can improve foresight, but they also amplify model error. If the model’s transition error at each step is \(\epsilon\), then rollout error can grow approximately as:

    \[|\hat{z}_{t+H}-z_{t+H}| \leq \sum_{k=0}^{H-1} L^k\epsilon\]
    • where \(L\) is an effective Lipschitz constant of the learned dynamics. When \(L>1\), errors can grow rapidly.
  • This is why many successful planner world models combine short rollouts with terminal values. Temporal Difference Learning for Model Predictive Control by Hansen et al. (2022) uses short-horizon latent planning with a terminal value function, so the learned model handles local control while the critic estimates long-term return.

  • The following figure (source) shows how TD-MPC performance varies with planning horizon and the number of planning iterations, illustrating the compute-performance trade-off in model predictive control. Specifically, it shows the return of TD-MPC under a variable computational budget on four other tasks from DMControl: Quadruped Run (\(\mathcal{A} \in \mathbb{R}^{12}\)), Fish Swim (\(\mathcal{A} \in \mathbb{R}^5\)), Reacher Hard \(\left(\mathcal{A} \in \mathbb{R}^2\right)\), and Cartpole Swingup Sparse \((\mathcal{A} \in \mathbb{R})\). We evaluate performance of fully trained agents when varying (blue) planning horizon; (green) number of iterations during planning. For completeness, we also include evaluation of the jointly learned policy \(\pi_\theta\), as well as the default setting of 6 iterations and a horizon of 5 used during training. Higher values require more compute. Mean of 5 runs.

Online Search versus Amortized Planning

  • Planner world models allocate computation in different places. Online search methods spend computation at decision time; amortized methods spend more computation during training so that inference is fast.

  • MuZero performs online tree search over a learned latent model:

\[a_t=\text{MCTS}(h_\theta(o_{\leq t}))\]
  • Dreamer trains an actor to amortize planning into a policy:
\[a_t\sim\pi_\phi(a_t\mid z_t)\]
  • TD-MPC combines both by using a learned policy prior to guide test-time trajectory optimization:
\[a_{t:t+H}^{*} =\text{MPC}(z_t,\pi_\phi,Q_\theta,R_\theta,d_\theta)\]

Data Efficiency and Imagination Efficiency

  • Planner world models improve sample efficiency by reusing experience through imagined futures. A single real transition can train many imagined rollouts:
\[(o_t,a_t,r_t,o_{t+1}) \rightarrow {\hat{z}_{t+1:t+H}^{(i)}}_{i=1}^{N}\]
  • Dreamer-style models are particularly efficient because thousands of latent rollouts can be generated in parallel. Dream to Control by Hafner et al. (2019) emphasizes that compact latent states reduce memory and compute, enabling large numbers of imagined trajectories during training.

  • However, imagination efficiency only helps if imagined trajectories remain useful. If the model predicts wrong rewards or loses state information, additional imagination can amplify bias rather than improve behavior.

Evaluation Criteria for Planner World Models

  • Planner world models should be evaluated by a dense set of criteria: closed-loop return should measure the realized task performance of the actions produced by the model; data efficiency should measure how many real environment interactions are needed to achieve competence; model exploitation resistance should measure whether the planner avoids actions that are only good under model error; compute efficiency should measure the cost of search, rollouts, value estimation, and policy updates; robustness should measure whether performance holds under domain shift, stochasticity, and partial observability; and transfer should measure whether learned dynamics, values, or task embeddings help on unseen tasks.

  • TD-MPC2 by Hansen et al. (2024) is especially relevant for transfer and scale because it trains multi-task world models across multiple domains, embodiments, and action spaces using learned task embeddings and shared hyperparameters.

Integration with Renderer and Simulator World Models

  • Planner world models need not be visually generative, but they benefit from renderers and simulators in different ways. A renderer can provide synthetic observations, human-interpretable rollouts, and visual debugging. A simulator can provide compact state transitions, counterfactual rollouts, and physically meaningful variables. A planner consumes either rendered observations or simulator states to select actions.

  • A combined stack can be written as:

\[o_t \xrightarrow{\text{encoder}} z_t \xrightarrow{\text{simulator}} \hat{z}_{t+1:t+H} \xrightarrow{\text{planner}} a_t\]
  • and optionally:
\[\hat{z}_{t+k} \xrightarrow{\text{renderer}} \hat{o}_{t+k}\]
  • This is the natural architecture for a unified world model: renderers make futures visible, simulators make futures computable, and planners make futures actionable.

Relationship to JEPA

  • JEPA connects directly to planner world models because it learns latent predictive states without reconstructing observations. A JEPA-style planner can use:
\[z_t=f_\theta(o_t)\] \[\hat{z}_{t+1}=g_\phi(z_t,a_t)\] \[a_t^*= \arg\min_a d(\hat{z}_{t+1},z_g)\]
  • This resembles task-oriented planning models such as TD-MPC, but JEPA emphasizes self-supervised latent prediction and collapse avoidance rather than reward-supervised task representation. A strong future planner could combine JEPA pretraining for scalable latent dynamics, task-oriented value learning for decision relevance, and MPC or search for action selection.

  • This completes the planning branch of the primer. The next section should return to Joint-Embedding Predictive Architectures as the latent predictive paradigm that can connect representation learning, simulation, and planning.

Joint-Embedding Predictive Architectures

Overview

  • Joint-Embedding Predictive Architectures (JEPAs) provide a general framework for learning predictive representations by aligning latent embeddings of related signals rather than reconstructing observations. They represent a shift from reconstruction to prediction, from observation space to representation space, and from static inputs to structured predictive tasks. This shift enables models to focus on predictable, semantically meaningful structure while discarding high-entropy, task-irrelevant details, making JEPA a natural foundation for scalable world modeling.

  • In the renderer-simulator-planner taxonomy, JEPA is most naturally a simulator-oriented latent world model: it predicts hidden state structure rather than directly rendering pixels. When the predictor is action-conditioned and paired with a goal objective, JEPA also becomes a substrate for planners. This makes JEPA complementary to renderer-first systems: renderers prioritize what the world should look like, while JEPA-style models prioritize which latent aspects of the world are predictable and useful for downstream reasoning or control. A Functional Taxonomy of World Models clarifies this distinction by separating world models according to whether they output observations, states, or actions.

Core Principle: Prediction in Representation Space

  • At the core of JEPA is the idea that learning should focus on predicting semantically meaningful aspects of a signal rather than reconstructing the signal itself. Given two compatible signals \(x\) and \(y\), for example, two spatial regions of an image or two time steps in a video, JEPA learns:

    \[s_x = f_\theta(x), \qquad s_y = f_{\bar{\theta}}(y)\] \[\hat{s}_y = g_\phi(s_x, \xi)\]
    • where \(\xi\) encodes auxiliary information such as spatial position, masking indices, temporal offsets, or actions.
  • The objective is to align predicted and target embeddings:

\[\mathcal{L}_{\text{JEPA}} = \mathbb{E}\left[\left| \hat{s}_y - s_y \right|_2^2 \right]\]
  • This formulation replaces pixel-level reconstruction with latent prediction, thereby focusing learning on predictable structure. In functional terms, the target is closer to simulated state than rendered observation: the model is trained to predict what should be true in representation space, not necessarily what every pixel should be.

Architectural Components

  • A JEPA system typically consists of three primary modules:

    • Context encoder \(f_\theta\): encodes visible or conditioning inputs \(x\).
    • Target encoder \(f_{\bar{\theta}}\): encodes target inputs \(y\), often using an exponential moving average (EMA) of the context encoder.
    • Predictor \(g_\phi\): maps context representations to predicted target representations.
  • The following figure (source) shows a comparison between joint-embedding, generative, and joint-embedding predictive architectures, illustrating how JEPA predicts embeddings rather than reconstructing signals.

Masking and Target Selection

  • A defining feature of JEPA is the masking strategy used to construct prediction tasks. Instead of predicting the entire input, JEPA selects target regions and conditions on complementary context.

  • For image-based JEPA:

  • Large target blocks encourage semantic prediction.
  • Spatially distributed context preserves enough global information for inference.
  • Multiple target regions increase coverage and reduce overfitting to a single local relation.

  • This ensures that prediction cannot be solved using trivial local correlations. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture by Assran et al. (2023) emphasizes that target blocks must be sufficiently large and context blocks sufficiently informative to produce semantic representations.

  • Formally, let \(M\) denote a masking operator:
\[x = M(o), \qquad y = (1 - M)(o)\]
  • The model learns to predict the latent representation of \(y\) given \(x\).

Comparison with Other Self-Supervised Objectives

  • JEPA differs from two dominant paradigms:

    • Contrastive learning: enforces similarity between augmented views but relies on negative samples.
    • Masked reconstruction: predicts missing pixels or tokens directly.
  • JEPA removes both the need for negative samples and the burden of pixel-level reconstruction. It instead enforces predictive consistency in latent space.

  • The distinction can be summarized as:

    • Contrastive: \(\text{maximize } \text{sim}(f(x), f(x^+))\)
    • Generative: \(\text{minimize } \mid \hat{o} - o \mid\)
    • JEPA: \(\text{minimize } \mid \hat{s}_y - s_y \mid\)
  • This distinction maps cleanly onto the functional taxonomy. Generative reconstruction is renderer-like because it optimizes observation fidelity; contrastive learning is representation-oriented but often not explicitly predictive; JEPA is simulator-like because it learns a compact predictive state space.

JEPA, Renderers, Simulators, and Planners

  • JEPA can be positioned precisely within the three functional world-model roles:

    • As a renderer alternative: JEPA avoids direct observation generation and therefore does not need to model every high-frequency visual detail.
    • As a simulator: JEPA predicts latent state transitions or masked latent states, making it well-suited to compact dynamics modeling.
    • As a planner substrate: action-conditioned JEPA can roll forward candidate latent trajectories and score them against a goal.
  • This is important because many systems called world models differ mainly in what they output. A video generator can be a world renderer, a physics engine can be a world simulator, and a policy model can be a world planner. JEPA is most naturally a simulator-style model that can become planner-capable when attached to an action-conditioned predictor and a planning objective. A Functional Taxonomy of World Models makes this distinction explicit by organizing world models by function rather than by architecture.

Avoiding Representation Collapse

  • A central challenge in JEPA training is collapse, where the model maps all inputs to a constant embedding:
\[s_x = c, \quad \forall x\]
  • This trivially minimizes the prediction loss if \(\hat{s}_y = c\).

  • JEPA avoids collapse through several mechanisms:

    • Architectural asymmetry between context and target encoders
    • Stop-gradient or EMA updates for the target encoder
    • Masking strategies that enforce non-trivial prediction tasks
  • More recent approaches introduce explicit regularization. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels by Maes et al. (2026) enforces Gaussian-distributed latent embeddings using a statistical regularizer, ensuring diversity and preventing collapse.

Temporal and Sequential Extensions

  • JEPA naturally extends to sequential data by treating future states as prediction targets:
\[\hat{z}_{t+1} = g_\phi(z_t, a_t)\]

Multimodal and Cross-Domain Extensions

From Representation Learning to World Modeling

  • JEPA becomes a world model when:

    • The inputs \(x\) and \(y\) correspond to temporally related observations
    • The predictor incorporates action or temporal information
    • The latent space supports planning or reasoning tasks
  • In this setting, JEPA learns a predictive latent space \(z_{t+1} \approx g_\phi(z_t, a_t)\) without reconstructing observations. This makes it computationally efficient and aligned with downstream control objectives.

  • V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning by Assran et al. (2025) demonstrates that such representations can scale to internet video and support planning after limited action-conditioned training.

  • From the functional-taxonomy perspective, this is the point where JEPA moves from representation learning into simulator and planner territory: latent predictions provide the simulated future, and action-conditioned rollouts provide the substrate for choosing interventions.

  • The next section focuses specifically on I-JEPA, detailing its design choices, masking strategy, and implementation at scale.

Image-Based Joint-Embedding Predictive Architecture (I-JEPA)

  • I-JEPA represents the first large-scale instantiation of the JEPA framework for visual representation learning. It is designed to learn high-level semantic features from images by predicting latent representations of masked regions using visible context, without relying on handcrafted augmentations or pixel-level reconstruction.

  • In the functional taxonomy of world models, I-JEPA is not yet a complete embodied world model because it does not model actions or temporal dynamics. It is best understood as a representation-learning substrate for simulator-style world models: it learns compact latent structure that later video, object-centric, or action-conditioned systems can use for prediction and planning. A Functional Taxonomy of World Models is useful here because it separates latent state modeling from rendering and planning, clarifying why an image-only model can still be foundational for later world-model systems.

Design Motivation

  • Prior self-supervised approaches in vision fall into two categories:

    • Invariance-based methods such as contrastive learning, which rely on augmentations to enforce representation similarity.
    • Generative methods such as masked autoencoders, which reconstruct missing pixels or tokens.
  • I-JEPA addresses limitations of both. It avoids augmentation-induced biases and does not require reconstructing high-frequency image details. Instead, it focuses on predicting only the predictable and semantically meaningful aspects of the image. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture by Assran et al. (2023) shows that representation-space prediction can produce semantic visual features without handcrafted view augmentations.

  • This design aligns with the hypothesis that intelligent systems learn by predicting the outcomes of partial observations rather than reconstructing full sensory inputs.

Architecture

  • I-JEPA consists of three main components:

    • Context encoder \(f_\theta\): processes visible image regions.
    • Target encoder \(f_{\bar{\theta}}\): processes masked target regions, typically updated via exponential moving average (EMA).
    • Predictor \(g_\phi\): maps context representations to predicted target representations.
  • The model operates entirely in latent space. Given an image \(o\), a masking strategy partitions it into:

    • Context blocks \(x\)
    • Target blocks \(y\)
  • The encoders produce embeddings:

\[s_x = f_\theta(x), \qquad s_y = f_{\bar{\theta}}(y)\]
  • The predictor then produces:

    \[\hat{s}_y = g_\phi(s_x, \text{pos}(y))\]
    • where positional embeddings encode spatial relationships.
  • The training objective is:

    \[\mathcal{L} = \sum_{y \in \mathcal{T}} \left| \hat{s}_y - s_y \right|_2^2\]
    • where \(\mathcal{T}\) is the set of target regions.

Masking Strategy

  • A key innovation in I-JEPA is its masking design. Unlike random patch masking used in reconstruction-based methods, I-JEPA uses:

    • Large target blocks: to ensure predictions require semantic understanding.
    • Spatially distributed context: to provide sufficient information for prediction.
  • This design prevents trivial solutions based on local pixel continuity and forces the model to capture global structure. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture by Assran et al. (2023) identifies target scale and context informativeness as central design choices for semantic representations.

  • The following figure (source) shows the I-JEPA architecture, where a context encoder processes visible patches, a target encoder produces representations of target blocks, and a predictor aligns predicted target embeddings with target-encoder embeddings. The I-JEPA architecture relies on the separation between context and target regions and the latent prediction mechanism. Specifically, it shows how the context encoder predicts embeddings of target regions using spatially distributed visible patches.

Training Dynamics

  • I-JEPA training relies on several mechanisms to ensure stability and scalability:

    • EMA target encoder: provides slowly evolving targets, reducing training instability.
    • No reconstruction decoder: reduces computational cost and avoids modeling irrelevant details.
    • Latent prediction loss: focuses on semantic consistency.
  • Importantly, the model does not require:

    • Negative samples, as in contrastive learning
    • Pixel-level losses, as in generative modeling
    • Strong data augmentations
  • This simplicity contributes to scalability.

I-JEPA as a State Abstraction Rather Than a Renderer

  • I-JEPA does not attempt to render missing pixels. This is its main distinction from masked autoencoders and diffusion-style visual reconstruction systems. In the renderer-simulator-planner taxonomy, reconstruction-based image models are closer to renderer models because their objective is observation fidelity. I-JEPA instead learns a compact visual state abstraction that is useful for later prediction.

  • This matters for world modeling because a simulator does not need to reproduce every sensory detail. It needs a state representation that supports stable prediction. I-JEPA supplies the image-level version of that idea: predict the latent content of missing regions, not their exact pixel realization.

Representation Properties

  • The representations learned by I-JEPA exhibit several desirable properties:

    • Semantic abstraction: captures object-level and scene-level information.
    • Robustness: less sensitive to low-level variations.
    • Transferability: performs well across downstream tasks such as classification, detection, and depth estimation.
  • Empirically, I-JEPA achieves strong performance on ImageNet linear evaluation while requiring less compute than competing methods. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture by Assran et al. (2023) reports strong downstream performance across classification, object counting, and depth prediction while avoiding view augmentations during pretraining.

Comparison with Masked Autoencoders

  • Masked autoencoders (MAE) reconstruct pixel values:
\[\mathcal{L}_{\text{MAE}} = \left| \hat{o} - o \right|^2\]
  • In contrast, I-JEPA predicts latent representations:
\[\mathcal{L}_{\text{I-JEPA}} = \left| \hat{s}_y - s_y \right|^2\]
  • This difference has important implications:

    • MAE must model high-frequency details.
    • I-JEPA focuses on predictable structure.
    • I-JEPA can ignore noise and stochastic variations.
  • From the functional-taxonomy perspective, MAE is more renderer-adjacent because it reconstructs observations, while I-JEPA is more simulator-adjacent because it trains a latent state representation without requiring observation-level output.

Scaling Behavior

  • I-JEPA scales effectively with model size and data. Using Vision Transformers, it can be trained efficiently on large datasets:

    • Large models converge with fewer epochs.
    • Performance improves with larger context and target regions.
    • Training remains stable due to EMA targets.
  • The method demonstrates that latent prediction is a viable alternative to both contrastive and generative objectives at scale.

Limitations

  • Despite its advantages, I-JEPA has several limitations:

    • No explicit temporal modeling: operates on single images.
    • Limited interaction reasoning: patch-based masking does not enforce object-level dynamics.
    • Deterministic predictions: does not explicitly model uncertainty.
  • These limitations motivate extensions to video, sequential modeling, and probabilistic formulations.

  • I-JEPA also lacks the full functional breadth of a world model. It is not a renderer because it does not generate observations, it is not a planner because it does not output actions, and it is only a partial simulator because it models static latent compatibility rather than temporal state transitions. Its importance lies in providing the representation-space prediction principle that later world-model systems extend.

Transition to Video and World Modeling

  • To become a full world model, JEPA must incorporate temporal structure and action conditioning. This leads to video-based extensions where the model predicts future latent states from past observations.

  • The next section examines V-JEPA and V-JEPA 2, which extend I-JEPA to video and enable understanding, prediction, and planning in dynamic environments.

Video JEPA and Scalable World Modeling

  • Extending JEPA from images to video transforms a static representation learner into a temporal predictive system. Video-based JEPA models learn to capture dynamics, motion, and temporal structure directly from sequences of observations, enabling the emergence of world modeling capabilities.

  • In the functional taxonomy of world models, video JEPA occupies an important middle ground. It is not a renderer-first model because it does not train by generating full video frames. It is not yet a complete planner unless actions and goals are introduced. Its core role is simulator-like: it learns latent state evolution from video, preserving the predictable structure needed for downstream understanding and control. A Functional Taxonomy of World Models separates renderers, simulators, and planners by output type, which clarifies why video generation and video latent prediction should not be treated as identical forms of world modeling.

From Spatial to Spatiotemporal Prediction

  • In I-JEPA, the prediction task is spatial: masked regions of an image are predicted from visible context. In video, this generalizes to spatiotemporal prediction:

    \[\hat{s}_{t+\Delta} = g_\phi(s_{\le t}, \xi)\]
    • where \(\Delta\) denotes a future time offset and \(\xi\) encodes temporal position or masking structure.
  • Instead of predicting pixels across time, video JEPA predicts latent embeddings of masked spatiotemporal regions. This allows the model to focus on predictable dynamics such as motion trajectories and object interactions, rather than reconstructing full video frames.

Video Renderers versus Video Latent Simulators

  • A video renderer models what future observations should look like:

    \[\hat{o}_{t+1:t+H} \sim p_\theta(o_{t+1:t+H}\mid o_{\le t},c)\]
    • where \(c\) may include text, camera motion, user input, or previous frames. This paradigm is useful for visual creation and imagination, but it can optimize visual plausibility without enforcing physically valid state transitions.
  • A video latent simulator models how the underlying representation evolves:

\[\hat{z}_{t+1:t+H} \sim p_\theta(z_{t+1:t+H}\mid z_{\le t})\]
  • This distinction matters because video generation can produce plausible images while failing to preserve object identity, geometry, contact, or causal consistency. Video JEPA is explicitly closer to the simulator side because it learns latent dynamics rather than frame synthesis. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning by Assran et al. (2025) frames this distinction directly by using representation-space prediction instead of video generation for scalable world modeling.

V-JEPA: Learning Dynamics from Video

  • Video JEPA models operate by masking portions of video clips and predicting their latent representations. The key design principles remain consistent:

    • Prediction in latent space
    • Masking-based task construction
    • Separation of context and target encoders
  • However, temporal structure introduces new challenges:

    • Capturing motion and temporal dependencies
    • Maintaining consistency across frames
    • Avoiding trivial interpolation solutions
  • These are addressed through structured masking and temporal encoding.

V-JEPA 2: Scaling to Internet Video

  • V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning by Assran et al. (2025) demonstrates that JEPA can scale to internet-scale video and serve as a foundation for world modeling. The paper reports pretraining on a video and image dataset comprising more than one million hours of internet video, followed by action-conditioned post-training on a smaller amount of robot interaction data.

  • The model is trained using a mask-denoising latent prediction objective. Unlike generative video models, it does not attempt to synthesize frames. Instead, it learns representations that capture:

    • Motion dynamics
    • Object behavior
    • Temporal dependencies
    • Appearance and action-relevant structure

Representation Learning at Scale

  • The scale of training data fundamentally changes the capabilities of the learned representation. V-JEPA 2 demonstrates that large-scale self-supervised video training yields representations that support:

    • Action recognition: identifying activities from video
    • Action anticipation: predicting future actions
    • Video question answering: reasoning about temporal events
  • These capabilities emerge without explicit supervision for these tasks during the core self-supervised pretraining stage, indicating that the model captures high-level semantic and temporal structure. V-JEPA 2 by Assran et al. (2025) reports strong motion understanding, state-of-the-art human action anticipation, and video question-answering performance after language alignment.

Action-Free Pretraining

  • A key insight is that meaningful world models can be learned without action labels. Video provides sequences of states:
\[o_1, o_2, \dots, o_T\]
  • From these, the model learns implicit dynamics:
\[z_{t+1} \approx g_\phi(z_t)\]
  • This enables large-scale pretraining using passive data, which is far more abundant than interaction data.

  • In the functional taxonomy, this is the transition from observation sequences to a simulator-like latent model. The model does not yet know which actions caused the observed transitions, but it learns regularities of motion, persistence, occlusion, and temporal change that later action-conditioned models can exploit. A Functional Taxonomy of World Models emphasizes that simulation is the bridge between rendering and planning because it captures how the world changes, not merely how it appears.

Action-Conditioned Post-Training

  • To enable control and planning, V-JEPA 2 introduces a second stage where actions are incorporated:
\[z_{t+1} = g_\phi(z_t, a_t)\]
  • This stage uses a relatively small amount of robot interaction data to learn action-conditioned dynamics on top of the pretrained representation. V-JEPA 2 by Assran et al. (2025) reports post-training an action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the DROID dataset.

  • The resulting model, V-JEPA 2-AC, supports planning by simulating trajectories in latent space.

Planning in Latent Space

  • Planning is performed using model predictive control (MPC) in the learned latent space. Given a goal state \(z^*\), the system searches for an action sequence that minimizes a cost function:
\[\min_{a_{t:t+H}} \sum_{k=0}^{H} | z_{t+k} - z^* |^2\]
  • The latent dynamics model is used to simulate candidate trajectories efficiently.

  • This approach avoids the computational cost of generating full video frames during planning. Functionally, it converts a simulator-style latent world model into a planner substrate: the model predicts possible futures, and the control loop selects the action sequence whose predicted future best matches the goal.

Relation to 3D and Spatial World Generation

  • Video JEPA learns from temporal observation sequences, while emerging spatial world models try to construct editable 3D environments. The two approaches address related but distinct problems. Video JEPA emphasizes latent temporal prediction, whereas 3D world generation emphasizes spatial state construction, view consistency, and editability.

  • Marble: A Multimodal World Model describes a system that creates editable 3D worlds from text, images, video, or coarse 3D layouts and can export worlds as Gaussian splats, meshes, or videos, placing it closer to the simulator-renderer boundary. A Functional Taxonomy of World Models situates this kind of work within a broader spatial-intelligence agenda, where world models must ultimately support rendering, simulation, and planning.

Advantages over Generative Video Models

  • Compared to generative video models, JEPA-based video models offer several advantages:

    • Efficiency: no need to generate high-resolution frames
    • Focus: emphasizes predictable dynamics rather than visual detail
    • Scalability: leverages large-scale video data effectively
    • Planning compatibility: directly produces latent states for control
  • Generative models often prioritize visual fidelity, while JEPA prioritizes predictive utility. In the functional taxonomy, this is the distinction between renderer optimization and simulator optimization: renderers must look right, while simulators must support reliable state evolution.

Limitations and Challenges

  • Despite its strengths, video JEPA faces several challenges:

    • Implicit action inference: action-free pretraining does not explicitly model causality
    • Deterministic predictions: uncertainty is not always captured
    • Limited object structure: patch-based representations may miss object-level interactions
  • These limitations motivate extensions that incorporate:

    • Action conditioning during training
    • Object-centric representations
    • Probabilistic modeling of uncertainty
    • Explicit 3D or spatial state structure
  • Video JEPA also inherits a broader challenge from the simulator paradigm: a compact latent state may support prediction while remaining difficult to inspect, edit, or validate. This matters when world models are used in safety-critical robotics, driving, or scientific settings.

Transition to Advanced JEPA World Models

  • The progression from I-JEPA to V-JEPA 2 demonstrates how predictive representation learning scales from static perception to dynamic world understanding and planning.

  • The next section explores advanced JEPA-based world models, including object-centric, causal, sequential, and probabilistic variants that address the limitations of current approaches.

Advanced JEPA World Models

  • While I-JEPA and V-JEPA establish the core paradigm of latent predictive learning, they do not fully address key requirements for robust world modeling: interaction reasoning, uncertainty, temporal abstraction, object permanence, and structured representations. Recent work extends JEPA along multiple axes to address these limitations, resulting in a family of advanced world models.

  • In the renderer-simulator-planner taxonomy, these advanced variants primarily strengthen the simulator and planner roles. They make latent state more sequential, object-centric, causal, probabilistic, or action-conditioned, which moves JEPA closer to the requirements of embodied intelligence. A Functional Taxonomy of World Models emphasizes that world models should be evaluated by function: rendering observations, simulating state, or planning actions.

Sequential JEPA and Temporal Structure

\[h_t = \text{Transformer}(z_{1:t}, a_{1:t-1})\] \[\hat{z}_{t+1} = g_\phi(h_t, a_t)\]
  • The model learns two types of representations:

    • Equivariant representations at the level of individual observations
    • Invariant representations at the level of aggregated sequence embeddings
  • This architectural separation resolves the trade-off between capturing fine-grained transformations and supporting high-level tasks such as classification.

  • DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture by He et al. (2026) further introduces ordered prediction, where targets are predicted sequentially based on importance:

\[\hat{s}_{y_1} \rightarrow \hat{s}_{y_2} \rightarrow \dots \rightarrow \hat{s}_{y_k}\]
  • This imposes a curriculum over prediction tasks, improving representation quality and interpretability.

  • From the functional-taxonomy perspective, sequential JEPA strengthens the simulator role because it learns how latent state evolves across ordered observations, not merely how isolated views relate.

Object-Centric and Causal JEPA

\[z_t = {z_t^{(1)}, z_t^{(2)}, \dots, z_t^{(N)}}\]
  • During training, subsets of object representations are masked, and the model must infer them from other objects:
\[\hat{z}_t^{(i)} = g_\phi({z_t^{(j)} : j \neq i})\]
  • This induces a causal inductive bias, as the model must reason about interactions rather than relying on local correlations.

  • The following figure (source) shows the C-JEPA training pipeline, where object-level masking forces inference of masked object states from surrounding context. Specifically, it shows how object-centric masking induces interaction reasoning and causal structure in latent space. A frozen encoder extracts object-centric representations, followed by selective masking across history. The predictor recovers masked history slots and predicts future latent states, conditioned on optional auxiliary variables, via a joint masked-history and forward-prediction objective.

  • This approach significantly improves performance on tasks requiring counterfactual reasoning and planning.

  • Object-centric JEPA is especially aligned with simulator-style world modeling. A useful simulator should not only predict latent feature vectors; it should expose stable entities, relations, and interventions. Object-level latent masking pushes the representation toward this form by making interaction structure necessary for prediction.

Spatial and 3D World Models

  • The functional taxonomy highlights that world modeling is not only temporal but also spatial. A model may need to represent a scene as a navigable, editable, physically meaningful 3D structure rather than as a sequence of 2D frames. This motivates spatial world models that combine aspects of renderers and simulators.

  • Marble: A Multimodal World Model describes a system that generates editable 3D worlds from text, images, video, or layout inputs and exports them as Gaussian splats, meshes, or videos. This kind of model is renderer-like when it produces views, but simulator-like when it maintains editable scene structure.

  • For JEPA, spatial world modeling suggests an important direction: predict structured latent scene state rather than patch embeddings alone. A future 3D-JEPA-style system could predict object pose, geometry, affordances, and interaction-relevant latent fields without reconstructing every pixel.

Multimodal and Motion-Aware JEPA

\[\mathcal{L} = \mathcal{L}_{\text{JEPA}} + \lambda \mathcal{L}_{\text{flow}}\]
  • This enables representations that capture both appearance and dynamics.

  • A-JEPA: Joint-Embedding Predictive Architecture Can Listen by Fei et al. (2023) extends JEPA to audio, using time-frequency masking strategies to capture temporal structure in spectrograms.

  • These models demonstrate that JEPA is a general predictive learning framework across modalities.

  • From a functional perspective, multimodal JEPA expands the observation interface of a world model. A robust agent should be able to map audio, vision, proprioception, language, and action histories into a shared predictive state.

Probabilistic JEPA and Uncertainty Modeling

\[q_\phi(z_{t+1} \mid z_t) \approx p(z_{t+1} \mid z_t)\]
  • with a variational objective:
\[\mathcal{L} = \mathbb{E}_{q} \left[ | \hat{z}_{t+1} - z_{t+1} |^2 \right] + D_{\text{KL}}(q(z_{t+1}) \parallel p(z_{t+1}))\]
  • This enables:

    • Uncertainty estimation via sampling
    • Robust prediction in stochastic environments
    • Planning under uncertainty
  • The framework connects JEPA to predictive state representations and Bayesian filtering.

  • Probabilistic JEPA is important for simulator and planner world models because a single predicted future is often insufficient. A planner must reason over multiple possible futures when observations are partial, dynamics are stochastic, or other agents behave unpredictably.

End-to-End JEPA World Models

  • Many JEPA systems rely on pre-trained encoders or auxiliary mechanisms to prevent collapse. End-to-end approaches aim to simplify training.

  • LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels by Maes et al. (2026) proposes a minimal formulation:

    \[\mathcal{L} = | \hat{z}_{t+1} - z_{t+1} |^2 + \lambda \mathcal{L}_{\text{reg}}\]
    • where the regularizer enforces Gaussian-distributed latent embeddings.
  • The following figure (source) shows the LeWorldModel training pipeline, illustrating joint optimization of encoder and predictor with a simple loss. LeWorldModel is a JEPA-based latent dynamics pipeline where an encoder produces latent states and a predictor models transitions across time. Specifically, it shows how latent dynamics are learned directly from pixels without pretraining or auxiliary objectives. Given frame observations \(\boldsymbol{o}_{1: T}\) and actions \(\boldsymbol{a}_{1: T}\), the encoder maps frames into low-dimensional latent representations \(\boldsymbol{z}_{1: T}\). The predictor models the environment dynamics by autoregressively predicting the next latent state \(\boldsymbol{z}_{t+1}\) from the current latent state \(\boldsymbol{z}_t\) and action \(\boldsymbol{a}_t\). The encoder and predictor are jointly optimized using a mean-squared error (MSE) prediction loss. LeWM does not rely on any training heuristics, such as stop-gradient, exponential moving averages, or pre-trained representations. To prevent trivial collapse, the SIGReg regularization term enforces Gaussian-distributed latent embeddings, promoting feature diversity. For tractability, latent embeddings are projected onto multiple random directions, and a normality test is applied to each one-dimensional projection. Aggregating these statistics encourages the full embedding distribution to match an isotropic Gaussian.

  • This approach reduces complexity while maintaining performance and stability.

Unified World Models

  • A fully general world model would combine the three functional roles:

    • Renderer: generate observations from state.
    • Simulator: roll forward latent or explicit world state.
    • Planner: choose actions that achieve goals.
  • The following figure (source) shows the convergence toward unified world models that combine rendering, simulation, and planning. Specifically, it shows a unified world-model architecture in which rendering produces interpretable observations, simulation maintains and evolves world state, and planning selects actions by evaluating predicted futures.

  • Advanced JEPA systems mostly strengthen the simulator and planner components, but they could be combined with renderers when human-interpretable visual output is needed. For example, a JEPA latent simulator could provide compact predictive dynamics, while a renderer decodes selected latent states into observations for inspection or communication.

Unifying Perspective

  • These extensions collectively transform JEPA into a comprehensive world modeling framework:

    • Sequential JEPA captures temporal dependencies
    • Object-centric JEPA models interactions and causality
    • Spatial world models expose editable 3D scene structure
    • Multimodal JEPA integrates diverse sensory inputs
    • Probabilistic JEPA represents uncertainty
    • End-to-end JEPA simplifies training and improves scalability
  • Together, they address the core challenges of world modeling: representation, dynamics, interaction, uncertainty, and action selection.

Transition to Implementation

  • While the conceptual framework of JEPA is well-defined, practical deployment requires careful design choices in architecture, masking, optimization, and scaling.

  • The next section provides detailed implementation guidance, including architectural configurations, training procedures, and engineering considerations for building JEPA-based world models.

Implementation Details for JEPA-Based World Models

  • Building a JEPA-based world model requires choosing the representation format, prediction target, masking policy, predictor architecture, collapse-prevention strategy, and planning interface. The implementation should be designed around the intended domain: images, video, audio, robotics, object-centric environments, or spatial 3D worlds.

  • The renderer-simulator-planner taxonomy is useful at implementation time because it forces the central design question: what should the system output? A renderer needs an observation decoder, a simulator needs a reliable latent or explicit state transition model, and a planner needs an action-selection mechanism. A Functional Taxonomy of World Models frames this distinction by separating world models according to whether they produce observations, states, or actions.

Data Representation

  • For image and video models, observations are usually converted into patch or tubelet tokens:
\[o \rightarrow {p_1, p_2, \dots, p_N}\]
  • For video, tubelets preserve local spatiotemporal structure:
\[o_{1:T} \rightarrow {p_{i,j,t}}\]
  • For object-centric models, a frozen or trainable object encoder maps observations into object slots:
\[z_t = {z_t^{(1)}, z_t^{(2)}, \dots, z_t^{(N)}}\]
  • This object-level representation is useful when the task depends on interaction, counterfactual reasoning, or physical causality, as in Causal-JEPA: Learning World Models through Object-Level Latent Interventions by Nam et al. (2026).

  • For spatial world models, the representation may instead be a 3D scene state, such as a mesh, point cloud, radiance field, Gaussian splat field, voxel grid, scene graph, object layout, or hybrid latent field. Marble: A Multimodal World Model illustrates this design space by producing editable 3D worlds from text, image, video, or coarse 3D layout inputs and exporting them in multiple visual or geometric formats.

Renderer, Simulator, and Planner Interfaces

  • A JEPA-based system should expose different interfaces depending on its intended role.

  • A renderer interface maps latent state to observations:

\[\hat{o}_t = d_\psi(z_t)\]
  • This decoder is optional in JEPA and is often omitted when the objective is representation learning or planning efficiency.

  • A simulator interface maps current latent state and action to future latent state:

\[\hat{z}_{t+1}=g_\phi(z_t,a_t)\]
  • This is the natural interface for JEPA world models, because the model is trained to predict representations rather than pixels.

  • A planner interface maps current state and goal to actions:

\[a_t^*=\pi_\omega(z_t,z_g)\]
  • or searches over candidate action sequences:

    \[a_{t:t+H}^* =\arg\min_{a_{t:t+H}} \sum_{k=1}^{H} d(\hat{z}_{t+k},z_g)\]
  • This separation makes the implementation modular: a latent JEPA simulator can be paired with a renderer for visualization, a planner for control, or both.

Encoder Design

  • Most JEPA implementations use Transformer encoders. A standard image configuration follows the Vision Transformer pipeline:

    \[x_i = E p_i + e_i\]
    • where \(E\) is a patch embedding matrix and \(e_i\) is a positional embedding.
  • For video, positional information must encode both space and time:

\[x_{i,j,t} = E p_{i,j,t} + e_i^{\text{row}} + e_j^{\text{col}} + e_t^{\text{time}}\]
  • Implementation choices typically include:

    • Patch size: smaller patches improve detail but increase compute.
    • Embedding dimension: larger dimensions improve capacity but increase memory.
    • Depth: deeper encoders improve abstraction.
    • Attention windowing: local attention can reduce video compute.
  • I-JEPA by Assran et al. (2023) uses Vision Transformers and shows that scaling the encoder improves representation quality while avoiding pixel reconstruction.

  • For spatial models, encoder design may require fusing 2D and 3D information. A practical architecture may use an image encoder for appearance, a depth or geometry encoder for spatial structure, and a cross-attention module to bind them into a scene-level latent state. This is especially relevant when the system is expected to simulate editable environments rather than only predict future video embeddings.

Target Encoder

  • The target encoder provides embeddings for masked or future targets. In many JEPA systems, it is an exponential moving average of the context encoder:

    \[\bar{\theta} \leftarrow m \bar{\theta} + (1-m)\theta\]
    • where \(m\) is the EMA momentum.
  • A high momentum value makes target representations stable, reducing oscillation and helping avoid collapse. The target encoder is usually used with stop-gradient:

    \[s_y = \text{sg}(f_{\bar{\theta}}(y))\]
    • where \(\text{sg}(\cdot)\) prevents gradients from flowing into the target branch.

Predictor Architecture

  • The predictor maps context embeddings to target embeddings:

    \[\hat{s}_y = g_\phi(s_x, q_y)\]
    • where \(q_y\) is a query embedding that encodes the target position, time, object identity, or action.
  • A common implementation is a lightweight Transformer decoder or MLP-Transformer hybrid:

\[h = \text{Transformer}_{\phi}([s_x; q_y])\] \[\hat{s}_y = W h_y\]
  • For action-conditioned world models, the predictor conditions on actions:

    \[\hat{z}_{t+1} = g_\phi(z_t, a_t)\]
    • or over a horizon:

      \[\hat{z}_{t+k+1} = g_\phi(\hat{z}_{t+k}, a_{t+k})\]
  • V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning by Assran et al. (2025) uses action-conditioned post-training to adapt video representations for robot planning.

  • For planner-oriented systems, the predictor must be stable under rollout. One-step prediction quality is not sufficient; errors compound across imagined trajectories. In practice, this motivates scheduled rollout training, latent consistency losses, short-horizon regularization, or model predictive control with frequent replanning.

Masking Policies

  • Masking determines what the model must predict. Poor masking can make the task too easy or too hard.

  • For images, useful masking policies include:

    • Large target blocks: encourage semantic prediction.
    • Distributed context blocks: preserve enough global information.
    • Multiple targets per image: improve coverage.
  • For video, masking should cover spatiotemporal regions:

\[M \subseteq H \times W \times T\]
  • Good video masks force the model to infer motion rather than interpolate nearby frames.

  • For audio, masking must respect time-frequency correlations. A-JEPA: Joint-Embedding Predictive Architecture Can Listen by Fei et al. (2023) uses curriculum masking that gradually shifts from random block masking to time-frequency-aware masking.

  • For object-centric models, object-level masking can enforce relational reasoning:

\[\hat{z}^{(i)} = g_\phi(z^{(-i)})\]
  • This prevents the model from relying only on local patch continuity.

  • For spatial world models, masking can operate over camera views, 3D regions, object slots, depth layers, or scene graph nodes. A simulator-oriented mask should remove enough structure to require spatial reasoning, while still preserving enough context to make the target predictable.

Loss Functions

  • The base JEPA loss is a latent regression loss:
\[\mathcal{L}_{\text{pred}} = \frac{1}{|\mathcal{T}|} \sum_{y \in \mathcal{T}} \left| g_\phi(f_\theta(x), q_y) - \text{sg}(f_{\bar{\theta}}(y)) \right|_2^2\]
  • For multi-task JEPA systems, additional terms may be added:
\[\mathcal{L} = \mathcal{L}_{\text{pred}} - \lambda_{\text{aux}}\mathcal{L}_{\text{aux}} - \lambda_{\text{reg}}\mathcal{L}_{\text{reg}}\] \[\mathcal{L}_{\text{render}}=|\hat{o}_t-o_t|^2\]
  • but this should be used carefully. If it dominates the objective, the system can drift back toward modeling high-entropy visual details rather than compact predictive state.

Collapse Prevention

  • Collapse occurs when the encoder maps all inputs to the same representation:
\[f_\theta(o)=c\]
  • This can minimize latent prediction loss while producing useless representations.

  • Common anti-collapse mechanisms include:

    • EMA target encoders
    • Stop-gradient target branches
    • Predictor bottlenecks
    • Variance or covariance regularization
    • Distributional regularizers
    • Careful masking
  • A simple diagnostic is the per-dimension feature variance:

\[\text{Var}(z_j) = \frac{1}{B}\sum_{i=1}^{B}(z_{ij}-\mu_j)^2\]
  • If many dimensions have near-zero variance, the representation is collapsing.

Training Loop

  • A typical JEPA training step is:

    • Sample observations or clips.
    • Generate context and target masks.
    • Encode the context with the context encoder.
    • Encode the full or target observation with the target encoder.
    • Predict target embeddings from context embeddings.
    • Compute latent prediction loss.
    • Backpropagate through context encoder and predictor.
    • Update target encoder using EMA.
  • In pseudocode:

context = apply_context_mask(batch, context_mask)
targets = apply_target_mask(batch, target_masks)

z_context = context_encoder(context)
with torch.no_grad():
    z_target = target_encoder(batch)
    z_target = gather_targets(z_target, target_masks)

z_pred = predictor(z_context, target_queries)
loss = mse(z_pred, z_target)

loss.backward()
optimizer.step()
ema_update(target_encoder, context_encoder)
  • For action-conditioned training, the batch also includes actions:
z_t = encoder(obs_t)
with torch.no_grad():
    z_next = target_encoder(obs_next)

z_pred_next = predictor(z_t, action_t)
loss = mse(z_pred_next, z_next)

Planning Interface

  • A JEPA becomes useful for control when the learned latent dynamics can evaluate candidate actions. Given a goal embedding \(z_g\), planning can minimize:

    \[J(a_{t:t+H}) = \sum_{k=1}^{H} d(\hat{z}_{t+k}, z_g)\]
    • where:

      \[\hat{z}_{t+k+1}=g_\phi(\hat{z}_{t+k},a_{t+k})\]
  • The distance \(d\) may be mean squared error, cosine distance, or a learned energy function.

  • This supports model predictive control:

    • Sample candidate action sequences.
    • Roll them out in latent space.
    • Score each rollout against the goal.
    • Execute the first action.
    • Replan at the next step.
  • In functional terms, this is where the simulator becomes useful to the planner. The JEPA predictor supplies imagined latent futures, and the planner selects actions that make the imagined future match the goal.

Evaluation

  • JEPA world models should be evaluated along several axes:

    • Representation quality: linear probing, k-NN, fine-tuning.
    • Prediction quality: latent prediction error over time.
    • Planning quality: success rate, trajectory efficiency.
    • Robustness: sensitivity to distractors, occlusion, distribution shift.
    • Uncertainty calibration: when using probabilistic JEPA variants.
  • For world modeling, downstream control and planning performance are usually more meaningful than pixel reconstruction metrics.

  • The renderer-simulator-planner taxonomy implies that evaluation should match the output role:

    • Renderer evaluation: visual fidelity, temporal coherence, view consistency, prompt controllability.
    • Simulator evaluation: state accuracy, physical consistency, rollout stability, editability, counterfactual validity.
    • Planner evaluation: goal completion, sample efficiency, robustness, safety, and recovery from distribution shift.

Engineering Considerations

  • Important implementation details include:

    • Normalize target embeddings before computing loss.
    • Use mixed precision for large video models.
    • Cache target masks and positional queries for efficiency.
    • Keep the predictor smaller than the encoder to avoid shortcut learning.
    • Use gradient clipping for long video sequences.
    • Monitor feature variance and pairwise cosine similarity during training.
    • Separate renderer, simulator, and planner modules unless there is a clear reason to train them end-to-end.
    • Use explicit state validation when deploying simulator-style models in safety-critical settings.
  • The next section covers probabilistic and energy-based interpretations of JEPA, including how JEPA connects to energy-based models, latent-variable inference, uncertainty-aware planning, and variational JEPA.

Probabilistic and Energy-Based Interpretations of JEPA

  • While JEPA is typically introduced as a deterministic latent prediction framework, its formulation admits deeper interpretations in terms of energy-based modeling, probabilistic inference, and predictive information. These perspectives are essential for extending JEPA to uncertainty-aware world models and principled planning systems.

  • In the renderer-simulator-planner taxonomy, probabilistic and energy-based views are most relevant to simulator and planner world models. A simulator must represent uncertainty over future states, and a planner must compare possible futures under goals, costs, constraints, and risks. A Functional Taxonomy of World Models separates these roles by output type, but probabilistic inference is the connective tissue that lets simulated futures support action selection.

Energy-Based View of JEPA

  • JEPA can be interpreted as an energy-based model (EBM), where the goal is to assign low energy to compatible pairs of representations and high energy to incompatible ones.

  • Define an energy function:

\[E_\theta(x, y) = \left| g_\phi(f_\theta(x)) - f_{\bar{\theta}}(y) \right|^2\]
  • Training minimizes this energy for compatible pairs \((x, y)\). In contrast to classical EBMs, JEPA does not explicitly sample negative pairs; instead, the architectural design and masking strategy implicitly define compatibility.

  • This aligns with the general formulation of energy-based learning, where the objective is to shape an energy landscape over possible configurations. Introduction to Latent Variable Energy-Based Models: A Path Towards Autonomous Machine Intelligence by Dawid and LeCun (2023) describes how such models avoid explicit likelihoods while still learning meaningful dependencies.

  • In this view, JEPA defines an implicit compatibility function over latent states, with prediction acting as a mechanism for energy minimization.

Predictive Information Perspective

  • Another interpretation is that JEPA maximizes predictive information between context and target representations.

  • Let \(z_x = f_\theta(x)\) and \(z_y = f_{\bar{\theta}}(y)\). The objective encourages \(z_x\) to retain information that is useful for predicting \(z_y\):

    \[\max I(z_x; z_y)\]
    • subject to compression constraints imposed by the encoder.
  • This connects JEPA to the predictive information bottleneck:

    \[\max I(z_x; z_y) - \beta I(z_x; x)\]
    • where \(\beta\) controls the trade-off between prediction and compression.
  • This formulation explains why JEPA representations tend to discard unpredictable details while preserving structure relevant for forecasting future states. In functional terms, it favors simulator-relevant information over renderer-only detail.

Deterministic vs Probabilistic Prediction

  • Standard JEPA models predict a single latent embedding:
\[\hat{z}_{t+1} = g_\phi(z_t)\]
  • This corresponds to a point estimate of the conditional distribution:
\[p(z_{t+1} \mid z_t)\]
  • However, real-world dynamics are often stochastic. Deterministic prediction can lead to averaging effects or loss of multimodal structure.

  • A renderer may express uncertainty visually by sampling multiple videos, but a simulator or planner requires uncertainty over state and action consequences. This is especially important when the agent is partially observing the world, interacting with other agents, or operating in safety-critical settings.

Variational JEPA

  • Variational JEPA: Probabilistic World Models by Huang (2026) extends JEPA to a probabilistic setting by modeling a distribution over future latent states.

  • The model introduces a latent variable \(\xi\):

    \[z_{t+1} \sim p_\theta(z_{t+1} \mid z_t, \xi)\]
    • with an approximate posterior:

      \[q_\phi(\xi \mid z_t, z_{t+1})\]
  • The training objective becomes a variational loss:

\[\mathcal{L} = \mathbb{E}_{q_\phi} \left[ \left| \hat{z}_{t+1} - z_{t+1} \right|^2 \right] +D_{\text{KL}}(q_\phi(\xi) \parallel p(\xi))\]
  • This formulation enables:

    • Modeling multiple plausible futures
    • Capturing uncertainty in predictions
    • Sampling-based planning

Latent State as a Predictive Information State

  • A key theoretical result is that JEPA latent states can serve as sufficient statistics for prediction and control.

  • Let \(z_t\) be the latent state learned by JEPA. Under certain conditions:

\[p(z_{t+1} \mid z_t, a_t)\]
  • is sufficient to describe the dynamics of the environment, without requiring access to raw observations.

  • This connects JEPA to Predictive State Representations (PSRs), where the state is defined by its predictive capability rather than its reconstruction fidelity. Predictive State Representations: A New Theory for Modeling Dynamical Systems by Boots et al. (2014) formalizes state as predictive capacity over future observations, which parallels JEPA’s emphasis on predictive latent state rather than reconstruction.

Bayesian JEPA and Belief Updates

  • Extensions such as Bayesian JEPA introduce explicit belief modeling. The latent state becomes a distribution:
\[b_t(z) = p(z_t \mid o_{\le t}, a_{<t})\]
  • Prediction involves propagating this belief:
\[b_{t+1}(z) = \int p(z_{t+1} \mid z_t, a_t) b_t(z_t) dz_t\]
  • In practice, this can be approximated using sampling or parametric distributions.

  • Bayesian formulations enable:

  • Uncertainty-aware planning
  • Robustness to partial observability
  • Integration of prior knowledge

  • From the functional taxonomy perspective, belief modeling is the formal bridge from simulator to planner. The simulator provides a distribution over possible next states, and the planner chooses actions that perform well under that distribution.

Planning with Energy-Based Objectives

  • The energy-based interpretation of JEPA enables planning as energy minimization.

  • Given a goal representation \(z_g\), define a cost:

    \[J(a_{t:t+H}) = \sum_{k=1}^{H} E_\theta(\hat{z}_{t+k}, z_g)\]
    • where:

      \[\hat{z}_{t+k+1} = g_\phi(\hat{z}_{t+k}, a_{t+k})\]
  • Planning becomes:

\[\min_{a_{t:t+H}} J(a_{t:t+H})\]
  • This formulation unifies prediction and control under a single energy framework.

  • A renderer-first model may generate candidate futures for human inspection, but an energy-based planner requires a score over futures. JEPA provides such a score naturally through latent compatibility.

Renderer, Simulator, and Planner Under Uncertainty

  • Uncertainty appears differently across functional world-model roles:

    • Renderer uncertainty: multiple plausible observations or videos.
    • Simulator uncertainty: multiple plausible latent states or physical evolutions.
    • Planner uncertainty: multiple possible action outcomes and risk-sensitive costs.
  • A unified world model should preserve these distinctions. Visual diversity is not equivalent to calibrated state uncertainty, and state uncertainty is not equivalent to robust action selection.

  • This distinction is important for evaluating future world models. A video model may appear diverse and realistic while failing as a simulator because it does not maintain consistent latent state; a simulator may predict plausible state transitions but fail as a planner if its cost function or action interface is poorly specified.

Collapse and Information Geometry

  • From an information-theoretic perspective, collapse corresponds to a degenerate solution where \(I(z_x; z_y) = 0\) because \(z_x\) contains no information about \(y\).

  • JEPA avoids collapse by:

    • Structuring the prediction task to require non-trivial information
    • Using asymmetric architectures
    • Regularizing latent distributions
  • In probabilistic JEPA, collapse can be analyzed through KL divergence and entropy terms, providing theoretical guarantees under certain assumptions.

Relation to Generative Models

  • Generative models optimize likelihood:
\[\max \log p(o)\]
  • JEPA instead optimizes predictive structure:
\[\min | \hat{z}_{t+1} - z_{t+1} |\]
  • This leads to different inductive biases:

    • Generative models capture full data distribution.
    • JEPA captures predictable structure.
  • As a result, JEPA is often more efficient for downstream tasks such as planning and control.

  • In taxonomy terms, generative models are often renderer-first: they optimize the distribution of observations. JEPA is simulator-first: it optimizes latent predictive consistency. A complete world model may eventually combine both, using JEPA-style latent prediction for compact dynamics and renderer modules for visual interpretation or communication.

Unifying View

  • From these perspectives, JEPA can be understood as:

    • An energy-based model over latent representations
    • A predictive information maximization framework
    • A latent dynamical system for world modeling
    • A foundation for probabilistic inference and planning
    • A simulator-oriented model that can become planner-capable through action conditioning and goal optimization
  • These interpretations provide the theoretical grounding for JEPA and motivate its extensions to more complex and realistic settings.

References

World-model framing and taxonomy

Renderer world models and video generation

Interactive renderer world models

Simulator world models and spatial representations

Learned physical simulators

Planner world models and latent control

Core JEPA papers

Advanced JEPA and latent world models

Representation learning foundations

Energy-based and probabilistic foundations

X / Twitter Threads

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledWorldModelsJEPA,
  title   = {World Models: Rendering, Simulation, Planning, and JEPA},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}