Robotic VLA Benefits from Joint Learning
with Motion Image Diffusion

Yu Fang1,2,* Kanchana Ranasinghe1 Le Xue1 Honglu Zhou1 Juntao Tan1 Ran Xu1 Shelby Heinecke1 Caiming Xiong1 Silvio Savarese1 Daniel Szafir2 Mingyu Ding2 Michael S. Ryoo1 Juan Carlos Niebles1
1 Salesforce AI Research
2 University of North Carolina at Chapel Hill
* Work done during an internship at Salesforce

TLDR; We proposed a joint learning strategy with motion image diffusion that enhances VLA models with motion reasoning capabilities, by extending VLA into a dual-head architecture with a DiT-based motion head for optical flow prediction alongside the standard action head.

Motivation

Vision-Language-Action (VLA) models have achieved remarkable progress in robotic manipulation by mapping multimodal observations and instructions directly to actions. However, they typically mimic expert trajectories without predictive motion reasoning, which limits their ability to reason about what actions to take.

To address this limitation, we propose joint learning with motion image diffusion, a novel strategy that enhances VLA models with motion reasoning capabilities. Our method extends the VLA architecture with a dual-head design: while the action head predicts action chunks as in vanilla VLAs, an additional motion head, implemented as a Diffusion Transformer (DiT), predicts optical-flow-based future motion images that capture future dynamics. The two heads are trained jointly, enabling the shared VLM backbone to learn representations that couple robot control with motion knowledge. This joint learning builds temporally coherent and physically grounded representations without modifying the inference pathway of standard VLAs, thereby maintaining test-time latency.

Experiments in both simulation and real-world environments demonstrate that joint learning with motion image diffusion improves the success rate of pi-series VLAs to 97.5% on the LIBERO benchmark and 58.0% on the RoboTwin benchmark, yielding a 23% improvement in real-world performance and validating its effectiveness in enhancing the motion reasoning capability of large-scale VLAs.

Method

We propose joint learning with motion image diffusion, a simple yet effective strategy that seamlessly improves VLAs with motion reasoning ability. Specifically, we extend VLA architecture with a dual-head design: an action head that predicts action chunks as in vanilla VLAs, and a motion head implemented as a Diffusion Transformer (DiT) that predicts optical-flow-based future motion images through diffusion. Both heads share the same VLM backbone and are optimized jointly, enabling the model to learn temporally coherent and physically grounded representations that support both fine-grained control and motion understanding. We find that optical-flow-based motion images offer an efficient and control-aligned supervision signal in joint learning. Unlike future image prediction or language-based motion descriptions, optical flow directly encodes how the scene moves, making it inherently consistent with action learning. This complementary supervision encourages the model to align physical motion dynamics with robot control, providing dense temporal guidance that improves visuomotor policy learning. Importantly, our strategy integrates seamlessly into existing large-scale VLA models with no additional inference latency, making it practical for real-world robotic deployment.

Results

Pick up the black bowl from table center and place it on the plate

Pick up the alphabet soup and place it in the basket

Open the middle drawer of the cabinet

Put the white mug on the left plate and put the yellow and white mug on the right plate

Turn on the stove and put the moka pot on it

Put the black bowl in the bottom drawer of the cabinet and close it

Grab the hammer and beat the block

Grab the shoe from the table and place it on the mat

Pick up the can and move it to beside the pot

Pick the scanner and pick the object and use the scanner to scan the object

Move the blocks to the center of the table and stack the green block on the red block

Citation

@article{fang2025robotic,
  title={Robotic VLA Benefits from Joint Learning with Motion Image Diffusion},
  author={Fang, Yu and Ranasinghe, Kanchana and Xue, Le and Zhou, Honglu and Tan, Juntao and Xu, Ran and Heinecke, Shelby and Xiong, Caiming and Savarese, Silvio and Szafir, Daniel and Ding, Mingyu and Ryoo, Michael S. and Niebles, Juan Carlos},
  journal={arXiv preprint arXiv:2512.18007},
  year={2025}
}