SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

Siyi Chen1,2,* Mikaela Angelina Uy2 Chan Hee Song3 Faisal Ladhak2 Adithyavairavan Murali2 Qing Qu1 Stan Birchfield2 Valts Blukis‡,2 Jonathan Tremblay‡,2
1University of Michigan 2NVIDIA 3Ohio State University Project Leads *Work done during an internship at NVIDIA

TL;DR: SpaceTools empowers VLMs with vision and robotic tools for spatial reasoning via Double Interactive Reinforcement Learning (DIRL), enabled by our Toolshed infrastructure. Achieves state-of-the-art performance on spatial reasoning benchmarks and enables precise real-world robot manipulation.

Abstract

Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning.

We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ask) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines.

SpaceTools Teaser

SpaceTools learns to use multiple computer vision tools to solve complex problems.

Spatial Reasoning Visualization

SpaceTools Reasoning Visualization

Spatial reasoning visualization of SpaceTools. It performs diverse spatial reasoning tasks including relative depth, pose, grasp, spatial compatibility, and spatial relationship by interleaving reasoning (gray) and vision tool calls (green) before producing the final answer.

Quantitative Results

Model RoboSpatial BLINK RefSpatial CVBench BOP-ask
VQA Vacant Overall Depth 2D Rel. 3D Depth Pose Grasp-MACE Grasp-SR
Proprietary Models
Claude Sonnet 4.5 75.44 23.77 57.43 78.23 7.49 89.85 78.50 1.67 40.12 48.33
GPT-4o 61.61 25.10 48.88 63.71 8.48 88.77 75.50 0.00 5.50 1.67
GPT-5 76.50 22.17 58.39 66.13 23.10 95.54 91.33 9.03 39.59 41.67
Gemini-ER 1.5 79.30 31.10 62.50 69.23 41.72 95.54 90.50 0.00 30.06 23.33
General Open-Source Models
LLaVA-NeXT-8B 69.31 0.00 45.15 53.23 0.78 72.15 73.67 0.00 5.04 1.67
Qwen2.5-VL-32B 61.84 3.28 41.43 70.16 7.28 90.46 86.67 0.00 29.86 23.33
Qwen2.5-VL-3B 53.07 0.00 35.71 70.98 0.00 70.62 65.33 0.00 6.06 0.00
Spatial VLMs
SpaceLLaVA-13B 61.00 2.50 40.61 51.61 3.25 61.08 62.83 0.00 0.00 0.00
RoboPoint-13B 70.18 19.70 52.58 54.84 15.59 74.00 76.50 0.00 0.00 0.00
Molmo-7B 39.92 0.82 26.29 54.03 0.00 72.15 73.33 0.00 36.74 18.33
RoboBrain2.0-7B 59.64 44.35 54.31 84.68 32.50 87.23 90.00 0.00 0.00 0.00
RoboRefer-8B-SFT 58.33 61.48 59.43 88.71 48.37 96.31 96.50 0.00 0.00 0.00
Tool-free Fine-tuning
Qwen2.5-VL-3B-Tool-free SFT 66.66 41.80 58.00 80.65 20.22 91.54 83.33 2.44 39.47 35.00
Qwen2.5-VL-3B-Tool-free RL 67.54 28.69 54.00 80.65 23.10 87.38 70.83 12.00 38.79 36.67
SpaceTools-3B (Ours) 79.38 52.46 70.00 90.32 53.07 94.92 96.00 34.37 43.06 50.00

Performance comparison across spatial reasoning benchmarks. All values are normalized accuracy (%). Bold indicates the best performance within each column, and underline denotes the second-best result.

Real-World Robot Execution

SpaceTools Robot Execution

Demo 1

SpaceTools Robot Execution

Demo 2

SpaceTools Robot Execution

Demo 3

SpaceTools demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. The model completes multi-step tasks via alternating reasoning (gray), vision tools (green) for perception, and robot tools (blue) for action.

Approach

The Key Steps of DIRL

Double Interactive Reinforcement Learning (DIRL) is a two-phase training framework designed to teach VLMs reliable and scalable multi-tool coordination. It progresses through:

  • Teaching Phase: We construct a curated training set by combining demonstrations from (a) a single-tool IRL-trained specialist with (b) high-quality multi-tool traces generated by a frontier model. This phase gives the model strong initialization and clear examples of grounded tool usage.
  • Exploration Phase: Starting from this initialization, the model undergoes full multi-tool interactive RL. Through exploration and feedback, it learns to sequence tools effectively and refine its coordination strategies.

The interactive RL, the collection of real tool-use traces with frontier models, and the inference-time tool deployment, are made possible by our scalable Toolshed infrastructure.

Toolshed and Interactive RL Training Visualization

Toolshed Infrastructure Interactive RL Training

Left: Toolshed is an infrastructure for deploying heavy tools during training and inference. It enables efficient tool execution and management for the DIRL framework.
Right: Visualization of the interactive reinforcement learning training process (via GRPO), showing how the model learns to coordinate multiple tools through exploration and feedback.

Core Features of Toolshed

Toolshed is a scalable, distributed, asynchronous framework for deploying compute-heavy vision and robotic tools alongside VLM training and inference. It mitigates bottlenecks through resource and environment isolation, decoupled tool execution, and asynchronous parallel workers that scale independently from model compute. Toolshed hosts modular vision tools (e.g., segmentation, pointing, depth, 3D box fitting, grasp prediction) and robotic tools (e.g., image capture, grasp execution, placement). It has the following core features:

  • Decoupled execution. Tool calls run outside the policy loop, preventing blocking.
  • Asynchronous workers. Parallel tool instances handle requests independently for high throughput.
  • Resource isolation. Each tool receives dedicated GPU/CPU resources.
  • Environment isolation. Tools run in separate Python environments to avoid dependency conflicts.
  • Elastic scaling. Additional workers can be spawned to handle usage spikes.
  • Multimodal data passing. Efficient transfer of text, images, and structured outputs (e.g., point clouds) across devices/nodes.

Citation

@misc{chen2025spacetoolstoolaugmentedspatialreasoning,
    title={SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL}, 
    author={Siyi Chen and Mikaela Angelina Uy and Chan Hee Song and Faisal Ladhak and Adithyavairavan Murali and Qing Qu and Stan Birchfield and Valts Blukis and Jonathan Tremblay},
    year={2025},
    eprint={2512.04069},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2512.04069}
}