Reasoning about fine-grained spatial relationships in
warehouse-scale environments poses a significant challenge
for existing vision-language models (VLMs), which often
struggle to comprehend 3D layouts, object arrangements,
and multimodal cues in real-world industrial settings.
In this paper, we present TinyGiantVLM, a lightweight and
modular two-stage framework designed for physical spatial
reasoning, distinguishing itself from traditional geographic
reasoning in complex logistics scenes. Our approach encodes
both global and region-level features from RGB and
depth modalities using pretrained visual backbones. To effectively
handle the complexity of high-modality inputs and
diverse question types, we incorporate a Mixture-of-Experts
(MoE) fusion module, which dynamically combines spatial
representations to support downstream reasoning tasks and
improve convergence. Training is conducted in a two-phase
strategy: the first phase focuses on generating free-form answers
to enhance spatial reasoning ability, while the second
phase uses normalized answers for evaluation.
Evaluated on Track 3 of the AI City Challenge 2025, our 64M-
parameter base model achieved 5th place on the leaderboard
with a score of 66.8861, demonstrating strong performance
in bridging visual perception and spatial understanding
in industrial environments. We further present
an 80M-parameter variant with expanded MoE capacity,
which demonstrates improved performance on spatial reasoning tasks.