Latent world models + vision-language reasoning

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Haichao Zhang¹, Yijiang Li², Shwai He³, Tushar Nagarajan⁴, Mingfei Chen⁵, Jianglin Lu¹, Ang Li³, and Yun Fu¹

¹ Northeastern University ² University of California San Diego ³ University of Maryland ⁴ The University of Texas at Austin ⁵ University of Washington

For questions about this public release, or if you encounter any issues reproducing the released setup, please contact [email protected].

ThinkJEPA pairs a cortex-like VLM reasoner for high-level semantics and long-horizon intent with a cerebellum-like JEPA controller for low-level dynamics, physical consistency, and fast local correction. Together they form a semantically grounded latent world model for downstream embodied prediction and control.

arXiv GitHub Hugging Face Cache

Dual-Temporal Dense JEPA observation plus uniformly sampled VLM reasoning.

Pyramid Guidance Multi-layer VLM signals guide prediction instead of only the final layer.

Code Just Released Let us know if you run into any problems.

Inside the ThinkJEPA. The VLM serves as a cortex-like reasoner, capturing high-level semantics and long-horizon intent. The JEPA branch behaves more like a cerebellum-like controller, modeling low-level dynamics, physical consistency, and rapid local corrections.

Abstract ThinkJEPA addresses a central limitation of dense latent world models: they forecast well at the local physical level, but often miss long-horizon semantics and intent. The paper combines a dense JEPA branch with a larger-stride VLM thinker branch and injects multi-layer VLM reasoning into the predictor through hierarchical pyramid representation extraction, improving downstream trajectory prediction and long-horizon rollout behavior.

Story

A Cortex and a Cerebellum

The project story is simple and memorable: ThinkJEPA has two complementary brains. One brain thinks. The other brain controls. The paper's dual-temporal architecture maps naturally onto that idea without reducing the method to a metaphor.

Cortex-like reasoner

The VLM branch thinks at a higher semantic level.

The VLM serves as a cortex-like reasoner, capturing high-level semantics, general knowledge, and long-horizon intent from uniformly sampled frames with a larger temporal stride.

Understands what is happening, not only how pixels move.
Brings in broader world knowledge learned from large-scale multimodal pretraining.
Guides the predictor toward intent, context, and semantics over longer horizons.

Cerebellum-like controller

The JEPA branch keeps the model physically grounded.

The JEPA branch behaves more like a cerebellum-like controller, modeling low-level dynamics, physical consistency, contact-sensitive motion, and rapid local corrections.

Tracks dense local motion that sparse VLM processing can miss.
Provides the fine-grained control signal needed for robotic hands and other effectors.
Supports downstream embodied tasks, with the paper's current experiments centered on hand-manipulation trajectory prediction.

Method

Dual-Temporal Pathway with Pyramid Guidance

Reading the paper closely, the method hinges on three ideas: keep a dense JEPA pathway for physical forecasting, add a large-stride VLM thinker pathway for semantics, and use hierarchical pyramid extraction so the predictor can benefit from the VLM's progressive reasoning process instead of only a terminal representation.

Architecture Outline

1. Dense JEPA branch Consumes densely sampled observations to preserve local dynamics and interaction cues.

2. VLM thinker branch Processes uniformly sampled frames at a larger temporal stride to capture broader intent and semantic context.

3. Pyramid representation extraction Aggregates intermediate and deep VLM representations into guidance signals compatible with latent prediction.

4. Predictor and task head Injects VLM guidance into JEPA prediction and outputs downstream trajectory forecasts for embodied tasks.

What matters technically

The VLM is used as guidance, not as a standalone dense predictor.
The larger temporal stride gives the reasoning branch a broader perception field.
Multi-layer VLM features are more useful than relying only on a final language-oriented layer.
The latent forecasting interface is preserved, which keeps the method useful for downstream control-oriented tasks.

Evidence

Why the Two-Brain Story is Useful

The paper's core empirical message is not merely that "more features help." It is that semantic reasoning helps most when it guides, rather than replaces, a physically grounded latent predictor.

Model	ADE	FDE	Accuracy
ThinkJEPA	0.0614	0.0556	0.5960
VJEPA_only	0.0706	0.0656	0.4706
VLM_only	0.1420	0.1441	0.0842

Interpretation

The VLM-only line underperforms badly on fine-grained prediction, which matches the paper's argument that language-aligned reasoning alone is not an adequate substitute for dense physical modeling.

The JEPA-only line remains strong on local forecasting, but ThinkJEPA improves it by adding broader intent and semantic grounding. That is exactly the motivation behind the cortex / cerebellum story.

Release

Code Just Released

This branch is a project-page snapshot with the released webpage, logo assets, and public code surface. Let us know if you run into any problems when using the release.

Public Release

Code Just Released

The public ThinkJEPA code is now available. If anything is unclear or inconvenient when setting up, running, or reproducing the released code, please let us know.

Need Help?

This project page presents the paper, the released code, and supporting resources. If you run into any problems, please reach out and let us know.

Citation

@article{zhang2026thinkjepa,
  title={ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model},
  author={Zhang, Haichao and Li, Yijiang and He, Shwai and Nagarajan, Tushar and Chen, Mingfei and Lu, Jianglin and Li, Ang and Fu, Yun},
  journal={arXiv preprint arXiv:2603.22281},
  year={2026}
}