TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints

Abstract

Reasoning about fine-grained spatial relationships in warehouse-scale environments poses a significant challenge for existing vision-language models (VLMs), which often struggle to comprehend 3D layouts, object arrangements, and multimodal cues in real-world industrial settings.

In this paper, we present TinyGiantVLM, a lightweight and modular two-stage framework designed for physical spatial reasoning, distinguishing itself from traditional geographic reasoning in complex logistics scenes. Our approach encodes both global and region-level features from RGB and depth modalities using pretrained visual backbones. To effectively handle the complexity of high-modality inputs and diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion module, which dynamically combines spatial representations to support downstream reasoning tasks and improve convergence. Training is conducted in a two-phase strategy: the first phase focuses on generating free-form answers to enhance spatial reasoning ability, while the second phase uses normalized answers for evaluation.

Evaluated on Track 3 of the AI City Challenge 2025, our 64M- parameter base model achieved 5th place on the leaderboard with a score of 66.8861, demonstrating strong performance in bridging visual perception and spatial understanding in industrial environments. We further present an 80M-parameter variant with expanded MoE capacity, which demonstrates improved performance on spatial reasoning tasks.

Problem Statement

An example of multiple-choice question (MCQ) in the dataset.

Warehouse Spa-tial Intelligence addresses this gap using the PhysicalAI-Spatial-Intelligence-Warehouse dataset for warehouse-scale 3D scene understanding through natural language questions. The challenge encompasses four distinct types of spatial reasoning tasks: distance estimation, object counting, multiple-choice questions (MCQ) for spatial grounding, and left/right spatial relation queries.

In the figure above, we listed an example question & answer pairs for multiple-choice questions (MCQ) type.

Method

Feature Extraction

Our method begins with a compact yet powerful feature extraction pipeline. By leveraging pretrained backbones on both RGB and depth inputs, we jointly encode global scene context and localized object features. This dual-stream representation enables the model to efficiently capture the multimodal spatial structure critical for fine-grained reasoning, while maintaining a small parameter footprint suitable for resource-constrained deployment.

Expert-Driven Multimodal Fusion via MoE

At the heart of TinyGiantVLM lies a Mixture-of-Experts (MoE) fusion module, specifically designed for spatial reasoning tasks. Instead of treating all spatial questions equally, our model learns to dynamically route inputs through specialized expert branches. Each expert specializes in a different reasoning pattern, and a learned gating mechanism dynamically selects the most relevant experts conditioned on the input.

Re-Inspection and Answer Generation

To further refine understanding of complex layouts, we introduce a re-inspection step that re-attends to key spatial relations before answer generation. This is integrated into a two-phase training paradigm: an initial phase encourages free-form reasoning to enhance flexibility, followed by a structured normalization phase optimized for benchmark evaluation.

Results and Ablation Study

We conduct ablation experiments to assess the impact of the two-phase training strategy. Phase 1 corresponds to free-form answers; Phase 2 corresponds to normalized answers. Scores are reported on the validation set.

MoE	Phase 1	Phase 2	Score (%)


✗	✓	✗	25.59
✗	✗	✓	63.65
✗	✓	✓	65.09
✓	✗	✓	68.13
✓	✓	✓	72.52

Challenge Leaderboard

TinyGiantVLM ranks 5^th on the public leaderboard of Track 3 in the AI City Challenge 2025, evaluated on 3D spatial reasoning tasks under resource constraints.

Rank	Team Name	Score (%)


1	UWIPL_ETRI	95.8638
2	HCMUT.VNU	91.9735
3	Embia	90.6772
4	MIZSU	73.0606
5	HCMUS_HTH	66.8861
6	MealsRetrieval	53.4763
7	BKU22	50.3662
8	Smart Lab	31.9245
9	AICV	28.2993

BibTeX

@misc{ly2025tinygiantvlmlightweightvisionlanguagearchitecture,
      title={TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints}, 
      author={Vinh-Thuan Ly and Hoang M. Truong and Xuan-Huong Nguyen},
      year={2025},
      eprint={2508.17595},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.17595}, 
}