TinyGiantVLM

A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints

Vinh-Thuan Ly    Hoang M. Truong    Xuan-Huong Nguyen


  • University of Science, VNU-HCM, Ho Chi Minh City, Vietnam

Abstract

Reasoning about fine-grained spatial relationships in warehouse-scale environments poses a significant challenge for existing vision-language models (VLMs), which often struggle to comprehend 3D layouts, object arrangements, and multimodal cues in real-world industrial settings.

In this paper, we present TinyGiantVLM, a lightweight and modular two-stage framework designed for physical spatial reasoning, distinguishing itself from traditional geographic reasoning in complex logistics scenes. Our approach encodes both global and region-level features from RGB and depth modalities using pretrained visual backbones. To effectively handle the complexity of high-modality inputs and diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion module, which dynamically combines spatial representations to support downstream reasoning tasks and improve convergence. Training is conducted in a two-phase strategy: the first phase focuses on generating free-form answers to enhance spatial reasoning ability, while the second phase uses normalized answers for evaluation.

Evaluated on Track 3 of the AI City Challenge 2025, our 64M- parameter base model achieved 5th place on the leaderboard with a score of 66.8861, demonstrating strong performance in bridging visual perception and spatial understanding in industrial environments. We further present an 80M-parameter variant with expanded MoE capacity, which demonstrates improved performance on spatial reasoning tasks.

Problem Statement

An example of multiple-choice question (MCQ) in the dataset.

Warehouse Spa-tial Intelligence addresses this gap using the PhysicalAI-Spatial-Intelligence-Warehouse dataset for warehouse-scale 3D scene understanding through natural language questions. The challenge encompasses four distinct types of spatial reasoning tasks: distance estimation, object counting, multiple-choice questions (MCQ) for spatial grounding, and left/right spatial relation queries.

In the figure above, we listed an example question & answer pairs for multiple-choice questions (MCQ) type.

Method

TinyGiantVLM Architecture.

Feature Extraction

Our method begins with a compact yet powerful feature extraction pipeline. By leveraging pretrained backbones on both RGB and depth inputs, we jointly encode global scene context and localized object features. This dual-stream representation enables the model to efficiently capture the multimodal spatial structure critical for fine-grained reasoning, while maintaining a small parameter footprint suitable for resource-constrained deployment.

Expert-Driven Multimodal Fusion via MoE

At the heart of TinyGiantVLM lies a Mixture-of-Experts (MoE) fusion module, specifically designed for spatial reasoning tasks. Instead of treating all spatial questions equally, our model learns to dynamically route inputs through specialized expert branches. Each expert specializes in a different reasoning pattern, and a learned gating mechanism dynamically selects the most relevant experts conditioned on the input.

Re-Inspection and Answer Generation

To further refine understanding of complex layouts, we introduce a re-inspection step that re-attends to key spatial relations before answer generation. This is integrated into a two-phase training paradigm: an initial phase encourages free-form reasoning to enhance flexibility, followed by a structured normalization phase optimized for benchmark evaluation.

Results and Ablation Study

We conduct ablation experiments to assess the impact of the two-phase training strategy. Phase 1 corresponds to free-form answers; Phase 2 corresponds to normalized answers. Scores are reported on the validation set.

MoE Phase 1 Phase 2 Score (%)
25.59
63.65
65.09
68.13
-

Challenge Leaderboard

TinyGiantVLM ranks 5th on the public leaderboard of Track 3 in the AI City Challenge 2025, evaluated on 3D spatial reasoning tasks under resource constraints.

Rank Team Name Score (%)
1UWIPL_ETRI95.8638
2HCMUT.VNU91.9735
3Embia90.6772
4MIZSU73.0606
5 HCMUS_HTH 66.8861
6MealsRetrieval53.4763
7BKU2250.3662
8Smart Lab31.9245
9AICV28.2993

BibTeX

@misc{ly2025tinygiantvlmlightweightvisionlanguagearchitecture,
      title={TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints}, 
      author={Vinh-Thuan Ly and Hoang M. Truong and Xuan-Huong Nguyen},
      year={2025},
      eprint={2508.17595},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.17595}, 
}