VISTA

Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

1Institute of Artificial Intelligence (TeleAI), China Telecom  ·  2Lumos Robotics
3University of Science and Technology of China  ·  4Northwestern Polytechnical University
5Shanghai Jiao Tong University  ·  6East China University of Science and Technology
7Harbin Engineering University  ·  8Fudan University
† Equal contribution · ‡ Project lead · * Corresponding authors

Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i) UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii) A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii) A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including \( \pi_{0.5} \), LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.

VISTA framework: UMI-VQA perception alignment, physical validation, two-stage co-training, and deployment

Why raw UMI data breaks standard VLA training?

01

Visual Grounding Mismatch

Standard VLA and VLM pipelines are built on relatively global, pinhole-perspective supervision. UMI instead records wrist-mounted fisheye views that are local, gripper-centric, and visually far from this pretraining regime.

Video · Main view

待上传 · 主视角 / 第三人称

VLM visual grounding

  • Pinhole projection
  • Global spatial cues

✅ Matches VLM pretraining

Video · Fisheye

待上传 · 手腕鱼眼 · 155° FOV

Fisheye view in UMI

  • Wrist-mounted gripper-centric (~155° FOV)
  • Severe radial distortion
  • Frequent gripper / arm self-occlusion

❌ Out-of-distribution for standard VLMs

Perception degradation under fisheye observations

Perception degradation under fisheye observations for VLMs
Perception degradation under fisheye observations for VLMs. Each VLM entry reports performance on original and fisheye-transformed images on four benchmarks.
Policy degradation under UMI-style wrist-fisheye observations on LIBERO and RoboTwin
Policy degradation under UMI-style wrist-fisheye observations on LIBERO and RoboTwin. For each benchmark, we compare the separately fine-tuned standard-view policy with the wrist-only fisheye policy.
02

Physical Plausibility Mismatch

Raw human demonstrations inherently lack awareness of target robot constraints, such as joint limits, workspace boundaries, and self-collision risks. Training VLA models on this unconstrained data causes the robot to learn physically infeasible actions, inevitably leading to systematic deployment failures.

To visually demonstrate this issue, we analyzed the execution of three representative tasks. Unconstrained trajectories often suffer from severe tracking errors. The curves below compare the position and orientation deviations between successful and failed trajectories, showing exactly how these physical limits disrupt the task.

Success trajectory
Failed trajectory
Glue stick handover
Drawer pulling
Stapler placement

The VISTA Framework

To address the inherent mismatches of raw UMI data, we propose VISTA, a UMI-oriented VLA framework comprising UMI-VQA for perceptual alignment, a systematic trajectory-level physical validation pipeline for embodiment-aware data curation, and a two-stage co-training recipe with a flow-matching action expert.
Perceptual alignment. We first tackle the visual domain gap by introducing an 8M-sample UMI-VQA dataset, adapting the model to wrist-mounted fisheye observations.
Physical validation. Next, we enforce physical plausibility with a data-completeness pre-check and trajectory scoring for continuity, self-collision risk, and execution fidelity, filtering out unexecutable human motions.
Two-stage co-training. Finally, we employ a two-stage training recipe: an initial VQA-Action co-training phase for representation alignment, followed by a flow-matching action expert for continuous control.

01

Perceptual Alignment

We construct the UMI-VQA Dataset to resolve the visual-grounding mismatch inherent in fisheye observations. This large-scale dataset contains 8M vision-language samples built from authentic real-world UMI demonstration frames and edited spatial-diversity images.

  • Scene Captioning — Provides concise, holistic descriptions of the scene, visible objects, and gripper-object relations.
  • Scene-State Understanding — Infers the task-relevant manipulation state, including the current gripper status, potential obstacles, and execution constraints.
  • Object Grounding — Localizes task-relevant targets by associating language references with bounding boxes under fisheye distortion.
  • Interaction Grounding — Identifies precise actionable regions where manipulation should occur, such as specific grasp sites, contact regions, and functional parts.
  • Spatial Reasoning — Evaluates geometric and relational understanding, requiring the model to reason about object layout, relative position, depth, and orientation.
UMI-VQA 8M dataset overview: real-world UMI VQA 3M pairs and spatial-diversity supplement 5M
02

Physical Validation

Before training, each trajectory is replayed in a cross-embodiment MuJoCo + Mink pipeline (RealMan, AC one, R1 Pro), scored along three complementary axes plus a data-completeness pre-check, and aggregated into an overall cross-embodiment score \(S(\tau, e)\) via a weighted geometric mean. Trajectories below the validation threshold are filtered out.

  • Trajectory continuity (\(s_{\mathrm{tc}}\)) — embodiment-agnostic gripper smoothness from consecutive waypoint displacement (position & angle); penalizes dropout, tracking loss, or abrupt motion
  • Self-collision risk (\(s_{\mathrm{sr}}\)) — minimum link–link distance during replay on the target embodiment; hard zero below a safety margin, full score above a clearance threshold
  • Execution fidelity (\(s_{\mathrm{ef}}\)) — tracking deviation between demonstrated and replayed end-effector poses, reflecting joint limits, singularities, workspace bounds, and controller bandwidth
Physical validation scoring system: trajectory continuity, self-collision risk, and execution fidelity
03

Model Training

Pre-training

VISTA is pre-trained on large-scale perception-aligned corpora comprising 8M UMI-VQA samples and 100K real-world robot trajectories. Pre-training proceeds in two stages: autoregressive co-training on VQA and discretized actions to align the VLM backbone with the fisheye observation regime, followed by continuous-action refinement via a knowledge-isolated flow-matching expert.

Stage 1
VQA-Action Autoregressive Co-training

We first co-train the VLA backbone on action prediction and VQA answering under a unified autoregressive objective.

  • Action learning. Each continuous action chunk \(a_{t:t+H-1}\) is converted into a sequence of discrete FAST tokens \(z_{1:N_a}\). Given visual observations \(o_t\), language instruction \(l\), and robot state \(s_t\), the target output is the action-token sequence \(z_{1:N_a}\).
  • VQA supervision. Given an image observation \(o\), a language question \(q\), and the target answer sequence \(u_{1:N_q}\), the target output is the answer-token sequence \(u_{1:N_q}\).
  • Objective. Action tokens and answer tokens are optimized with the same autoregressive next-token prediction loss.
Stage 2
Knowledge-Isolated Flow Matching Action Expert

To prevent catastrophic forgetting of the perception and discrete-action knowledge acquired in Stage 1, we follow the knowledge-isolation strategy, keeping the pretrained VLA backbone frozen and training a separate continuous action expert on top.

Downstream Task Fine-tuning

After pre-training, VISTA is adapted to downstream tasks and target embodiments. We first apply a strict physical validation threshold to filter downstream task data for the specific deployment robot, removing trajectories with kinematic violations, self-collision risks, or poor replay fidelity.

During fine-tuning, we unfreeze the full model and update both the VLA backbone and the flow-matching action expert end-to-end—preserving generalist visual-linguistic knowledge while specializing perception and continuous action generation to the target robot’s dynamics and task distribution.

Multi-robot deployment System

To fully exploit the cross-embodiment potential of VISTA, we implement a pure Python distributed deployment architecture for heterogeneous robotic arms. The system uses Zenoh as the communication middleware, enabling transparent shared-memory or network-level transmission across local processes, LANs, or WANs. State streams from multiple arms are aggregated to distributed GPU compute nodes for batched inference; predicted action chunks are temporally ensembled and routed back to the respective robots via synchronous RPC calls to ensure strict temporal alignment and prevent command accumulation. This design eliminates heavy ROS dependencies and allows seamless integration of new robot arms that satisfy the UMI end-effector mounting specification.

Multi-robot deployment system: Zenoh middleware connecting host and satellite GPU nodes to heterogeneous robot arms

Experimental Results

Diagnostic Validation Experiments

Diagnostic validation aims to verify the two UMI-to-VLA bottlenecks: perception mismatch and physical infeasibility. We evaluate these challenges by measuring policy degradation under wrist-fisheye views on the LIBERO and RoboTwin benchmarks, assessing the impaired visual reasoning of VLMs across four spatial benchmarks (Where2Place, RefSpatial, ERQA, EmbSpatial), and demonstrating the physical unexecutability of raw human trajectories via real-robot replay on the RealMan embodiment.

Fisheye observations degrade policy learning

We evaluate the impact of UMI-style observations on policy learning. Table 1 confirms this genuine perception bottleneck across multiple VLA models (\( \pi_{0.5} \), LingBot-VLA, WALL-X) and benchmarks (LIBERO, RoboTwin), demonstrating consistent performance degradation under wrist-fisheye training.

Table 1: Policy degradation under wrist-fisheye observations across models and benchmarks

Fisheye observations degrade robot-relevant visual reasoning

To quantify visual reasoning degradation, we evaluate five general and robot-specialized VLMs across four spatial benchmarks. As shown in Table 2, fisheye-transformed images cause a consistent performance drop across all models (averaging an 8.6% relative degradation) compared to standard perspectives. This confirms that pretrained VLMs lose critical spatial and object understanding under distorted wrist-fisheye views, necessitating our perception-aligned VQA.

Table 2: Visual reasoning degradation under fisheye-transformed images across VLMs and spatial benchmarks

Raw UMI trajectories are not always executable

We audit the physical executability of raw UMI trajectories on a RealMan robot across three representative tasks. As illustrated in Figure 6, both simulation and real-world replays reveal significant deviations between human-demonstrated trajectories and feasible robot poses due to inherent physical constraints. These execution deviations directly cause task failures, empirically confirming that human-collected UMI data cannot serve as executable supervision without physical validation, otherwise VLA models will learn physically infeasible actions.

Table 3: Visual reasoning degradation under fisheye-transformed images across VLMs and spatial benchmarks

Citation

@article{yang2026vista,
  title   = {VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training}, 
  author  = {Siyuan Yang and Linzheng Guo and Ouyang Lu and Zhaxizhuoma and Daoran Zhang and Xinmiao Wang and Ting Xiao and Fangzheng Yan and Zhijun Chen and Yan Ding and Chao Yu and Chenjia Bai and Xuelong Li},
  journal = {arXiv preprint arXiv:2606.04708},
  year    = {2026},
}