Loading...

Flow Equivariant World Modeling for
Partially Observed Dynamic Environments

Anonymous Submission

The Flow Equivariant World Model (FloWM) predicts 3D dynamics in partially observable environments by maintaining a latent memory map equivariant to external motion (flows), and self-motion (agent movement).

Abstract

Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the motion of external objects. These streams obey smooth, time-parameterized symmetries, which combine through a precisely structured algebra; yet most neural network world models ignore this structure and instead repeatedly re-learn the same transformations from data. In this work, we introduce 'Flow Equivariant World Models', a framework in which both self-motion and external object motion are unified as one-parameter Lie group 'flows'. We leverage this unification to implement group equivariance with respect to these transformations, thereby sharing model weights over locations and motions, eliminating redundant re-learning, and providing a stable latent world representation over hundreds of timesteps. On both 2D and 3D partially observed world modeling benchmarks, we demonstrate Flow Equivariant World Models significantly outperform comparable state-of-the-art diffusion-based and memory-augmented world-modeling architectures, training faster and reaching lower error -- particularly when there are predictable world dynamics outside the agent's current field of view. We show that flow equivariance is particularly beneficial for long rollouts, generalizing far beyond the training horizon. By structuring world model representations with respect to internal and external motion, flow equivariance charts a scalable route to data-efficient, symmetry-guided, embodied intelligence.

missing
2D Scene demonstrating the problem of partially observable world modeling.

Model Framework (2D and 3D)

missing
a) FloWM Recurrence relation in 2D. Velocity channels are plotted as rows. At each timestep, the internal flow and action flow compose and act upon the latent memory map representation, which is used to predict the observation at the next timestep. b) Rollout error over time, for FloWM and ablations, and baselines. c) Training loss across batches, demonstrating how the full equivariance allows the model to learn significantly quicker.

missing
FloWM Recurrence relation in 3D. a) Information passes from the image observations to the hidden state through a ViT encoder. b) The new updates are combined with the existing hidden state, and the action and internal flows roll the hidden state to the next timestep. c) The updated hidden state is used to predict the next timestep observation.

Rollout Results

We evaluate on 2D (MNIST World) and 3D (Dynamic Block World) datasets. We compare FloWM rollouts and quantitative results with a Diffusion Forcing Transformer baseline (DfoT) and a long-context SSM-based baseline (DFoT SSM). We additionally include ablations without velocity channels (VC) and self-motion equivariance (SME). The first 50 frames are fed in to the model as ground truth context, and the model must predict the next 150 frames. The ground truth observations are available for reference.

Dynamic Partially Observable 2D MNIST World
Static Partially Observable 2D MNIST World

* Since the world has no velocity, the velocity channels are redundant and only add noise in this case.

Dynamic Fully Observable 2D MNIST World

* For fully observable cases, the World View (GT) is same as Agent View (GT).

3D Dynamic Block World Rollout #2
3D Dynamic Block World Rollout #3

We visualize failure cases of our model, in comparison to DFoT and DFoT SSM by visualizing low PSNR rollouts:

3D Dynamic Block World Rollout (Medium PSNR)
3D Dynamic Block World Rollout (Low PSNR)

Results Table

Validation Metrics on 2D Dynamic Partially Observable MNIST World

missing

Columns show mean metrics (MSE, PSNR, SSIM) of frames over the first 20 generated frames (matches training distribution) vs. 150 generated frames (length generalization). 50 frames are passed in as context.

Validation Metrics on 3D Dynamic Block World

missing

Columns show mean metrics (MSE, PSNR, SSIM) of generated frames over the first 20 frames (matches training distribution) vs. 150 frames (length generalization). 50 frames are passed in as context.


Top