TL;DR: SpaceTools empowers VLMs with vision and robotic tools for spatial reasoning via Double Interactive Reinforcement Learning (DIRL), enabled by our Toolshed infrastructure. Achieves state-of-the-art performance on spatial reasoning benchmarks and enables precise real-world robot manipulation.
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning.
We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ask) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines.
SpaceTools learns to use multiple computer vision tools to solve complex problems.
Spatial reasoning visualization of SpaceTools. It performs diverse spatial reasoning tasks including relative depth, pose, grasp, spatial compatibility, and spatial relationship by interleaving reasoning (gray) and vision tool calls (green) before producing the final answer.
| Model | RoboSpatial | BLINK | RefSpatial | CVBench | BOP-ask | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| VQA | Vacant | Overall | Depth | 2D Rel. | 3D Depth | Pose | Grasp-MACE | Grasp-SR | ||
| Proprietary Models | ||||||||||
| Claude Sonnet 4.5 | 75.44 | 23.77 | 57.43 | 78.23 | 7.49 | 89.85 | 78.50 | 1.67 | 40.12 | 48.33 |
| GPT-4o | 61.61 | 25.10 | 48.88 | 63.71 | 8.48 | 88.77 | 75.50 | 0.00 | 5.50 | 1.67 |
| GPT-5 | 76.50 | 22.17 | 58.39 | 66.13 | 23.10 | 95.54 | 91.33 | 9.03 | 39.59 | 41.67 |
| Gemini-ER 1.5 | 79.30 | 31.10 | 62.50 | 69.23 | 41.72 | 95.54 | 90.50 | 0.00 | 30.06 | 23.33 |
| General Open-Source Models | ||||||||||
| LLaVA-NeXT-8B | 69.31 | 0.00 | 45.15 | 53.23 | 0.78 | 72.15 | 73.67 | 0.00 | 5.04 | 1.67 |
| Qwen2.5-VL-32B | 61.84 | 3.28 | 41.43 | 70.16 | 7.28 | 90.46 | 86.67 | 0.00 | 29.86 | 23.33 |
| Qwen2.5-VL-3B | 53.07 | 0.00 | 35.71 | 70.98 | 0.00 | 70.62 | 65.33 | 0.00 | 6.06 | 0.00 |
| Spatial VLMs | ||||||||||
| SpaceLLaVA-13B | 61.00 | 2.50 | 40.61 | 51.61 | 3.25 | 61.08 | 62.83 | 0.00 | 0.00 | 0.00 |
| RoboPoint-13B | 70.18 | 19.70 | 52.58 | 54.84 | 15.59 | 74.00 | 76.50 | 0.00 | 0.00 | 0.00 |
| Molmo-7B | 39.92 | 0.82 | 26.29 | 54.03 | 0.00 | 72.15 | 73.33 | 0.00 | 36.74 | 18.33 |
| RoboBrain2.0-7B | 59.64 | 44.35 | 54.31 | 84.68 | 32.50 | 87.23 | 90.00 | 0.00 | 0.00 | 0.00 |
| RoboRefer-8B-SFT | 58.33 | 61.48 | 59.43 | 88.71 | 48.37 | 96.31 | 96.50 | 0.00 | 0.00 | 0.00 |
| Tool-free Fine-tuning | ||||||||||
| Qwen2.5-VL-3B-Tool-free SFT | 66.66 | 41.80 | 58.00 | 80.65 | 20.22 | 91.54 | 83.33 | 2.44 | 39.47 | 35.00 |
| Qwen2.5-VL-3B-Tool-free RL | 67.54 | 28.69 | 54.00 | 80.65 | 23.10 | 87.38 | 70.83 | 12.00 | 38.79 | 36.67 |
| SpaceTools-3B (Ours) | 79.38 | 52.46 | 70.00 | 90.32 | 53.07 | 94.92 | 96.00 | 34.37 | 43.06 | 50.00 |
Performance comparison across spatial reasoning benchmarks. All values are normalized accuracy (%). Bold indicates the best performance within each column, and underline denotes the second-best result.
Demo 1
Demo 2
Demo 3
SpaceTools demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. The model completes multi-step tasks via alternating reasoning (gray), vision tools (green) for perception, and robot tools (blue) for action.
Double Interactive Reinforcement Learning (DIRL) is a two-phase training framework designed to teach VLMs reliable and scalable multi-tool coordination. It progresses through:
The interactive RL, the collection of real tool-use traces with frontier models, and the inference-time tool deployment, are made possible by our scalable Toolshed infrastructure.
Left: Toolshed is an infrastructure for deploying heavy tools during training
and inference. It enables efficient tool execution and management for the DIRL framework.
Right: Visualization of the interactive reinforcement learning training process (via GRPO), showing how
the model learns to coordinate multiple tools through exploration and feedback.
Toolshed is a scalable, distributed, asynchronous framework for deploying compute-heavy vision and robotic tools alongside VLM training and inference. It mitigates bottlenecks through resource and environment isolation, decoupled tool execution, and asynchronous parallel workers that scale independently from model compute. Toolshed hosts modular vision tools (e.g., segmentation, pointing, depth, 3D box fitting, grasp prediction) and robotic tools (e.g., image capture, grasp execution, placement). It has the following core features:
@misc{chen2025spacetoolstoolaugmentedspatialreasoning,
title={SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL},
author={},
year={2025},
eprint={2512.04069},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.04069}
}