SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

Abstract

Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning.

We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ask) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines.

SpaceTools learns to use multiple computer vision tools to solve complex problems.

Quantitative Results

Model	RoboSpatial			BLINK	RefSpatial	CVBench		BOP-ask
Model	VQA	Vacant	Overall	Depth	RefSpatial	2D Rel.	3D Depth	Pose	Grasp-MACE	Grasp-SR
Proprietary Models
Claude Sonnet 4.5	75.44	23.77	57.43	78.23	7.49	89.85	78.50	1.67	40.12	48.33
GPT-4o	61.61	25.10	48.88	63.71	8.48	88.77	75.50	0.00	5.50	1.67
GPT-5	76.50	22.17	58.39	66.13	23.10	95.54	91.33	9.03	39.59	41.67
Gemini-ER 1.5	79.30	31.10	62.50	69.23	41.72	95.54	90.50	0.00	30.06	23.33
General Open-Source Models
LLaVA-NeXT-8B	69.31	0.00	45.15	53.23	0.78	72.15	73.67	0.00	5.04	1.67
Qwen2.5-VL-32B	61.84	3.28	41.43	70.16	7.28	90.46	86.67	0.00	29.86	23.33
Qwen2.5-VL-3B	53.07	0.00	35.71	70.98	0.00	70.62	65.33	0.00	6.06	0.00
Spatial VLMs
SpaceLLaVA-13B	61.00	2.50	40.61	51.61	3.25	61.08	62.83	0.00	0.00	0.00
RoboPoint-13B	70.18	19.70	52.58	54.84	15.59	74.00	76.50	0.00	0.00	0.00
Molmo-7B	39.92	0.82	26.29	54.03	0.00	72.15	73.33	0.00	36.74	18.33
RoboBrain2.0-7B	59.64	44.35	54.31	84.68	32.50	87.23	90.00	0.00	0.00	0.00
RoboRefer-8B-SFT	58.33	61.48	59.43	88.71	48.37	96.31	96.50	0.00	0.00	0.00
Tool-free Fine-tuning
Qwen2.5-VL-3B-Tool-free SFT	66.66	41.80	58.00	80.65	20.22	91.54	83.33	2.44	39.47	35.00
Qwen2.5-VL-3B-Tool-free RL	67.54	28.69	54.00	80.65	23.10	87.38	70.83	12.00	38.79	36.67
SpaceTools-3B (Ours)	79.38	52.46	70.00	90.32	53.07	94.92	96.00	34.37	43.06	50.00

Performance comparison across spatial reasoning benchmarks. All values are normalized accuracy (%). Bold indicates the best performance within each column, and underline denotes the second-best result.

Approach

The Key Steps of DIRL

Double Interactive Reinforcement Learning (DIRL) is a two-phase training framework designed to teach VLMs reliable and scalable multi-tool coordination. It progresses through:

Teaching Phase: We construct a curated training set by combining demonstrations from (a) a single-tool IRL-trained specialist with (b) high-quality multi-tool traces generated by a frontier model. This phase gives the model strong initialization and clear examples of grounded tool usage.
Exploration Phase: Starting from this initialization, the model undergoes full multi-tool interactive RL. Through exploration and feedback, it learns to sequence tools effectively and refine its coordination strategies.

The interactive RL, the collection of real tool-use traces with frontier models, and the inference-time tool deployment, are made possible by our scalable Toolshed infrastructure.

Toolshed and Interactive RL Training Visualization

Left: Toolshed is an infrastructure for deploying heavy tools during training and inference. It enables efficient tool execution and management for the DIRL framework.
Right: Visualization of the interactive reinforcement learning training process (via GRPO), showing how the model learns to coordinate multiple tools through exploration and feedback.

Core Features of Toolshed

Toolshed is a scalable, distributed, asynchronous framework for deploying compute-heavy vision and robotic tools alongside VLM training and inference. It mitigates bottlenecks through resource and environment isolation, decoupled tool execution, and asynchronous parallel workers that scale independently from model compute. Toolshed hosts modular vision tools (e.g., segmentation, pointing, depth, 3D box fitting, grasp prediction) and robotic tools (e.g., image capture, grasp execution, placement). It has the following core features:

Decoupled execution. Tool calls run outside the policy loop, preventing blocking.
Asynchronous workers. Parallel tool instances handle requests independently for high throughput.
Resource isolation. Each tool receives dedicated GPU/CPU resources.
Environment isolation. Tools run in separate Python environments to avoid dependency conflicts.
Elastic scaling. Additional workers can be spawned to handle usage spikes.
Multimodal data passing. Efficient transfer of text, images, and structured outputs (e.g., point clouds) across devices/nodes.

Citation

@misc{chen2025spacetoolstoolaugmentedspatialreasoning,
    title={SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL}, 
    author={Siyi Chen and Mikaela Angelina Uy and Chan Hee Song and Faisal Ladhak and Adithyavairavan Murali and Qing Qu and Stan Birchfield and Valts Blukis and Jonathan Tremblay},
    year={2025},
    eprint={2512.04069},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2512.04069}
}