Existing research on 3D Large Language Models (LLMs) still struggles to achieve efficient and explainable reasoning, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework.
We first introduce a Chain-of-Thought reasoning method in 3D scenes (SceneCOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we build the first large-scale 3D scene Chain-of-Thought reasoning dataset SceneCOT-212K, including 212K high-quality data instances.
Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves state-of-the-art with clear interpretability. To our knowledge, this is the first attempt to successfully implement the COT technique for achieving human-like step-by-step reasoning for 3D scene understanding, where we show great potential in extending it to a wider range of 3D scene understanding scenarios.
Reasoning chain visualization of SceneCOT. The reasoning chain mimics humans' recognition process: from high-level task recognition to low-level visual semantic understanding.
SceneCOT framework. Our framework enables interpretable step-by-step reasoning in 3D scenes through a Chain-of-Thought (CoT) process.
We provide the illustration of the predicted reasoning chains with the ground-truth annotations. In the first examples, we clearly show the reasoning chains of SceneCOT, which are semanticly consistent with the ground-truth annotations even with some noise. The reasoning chains are interpretable and can be used to explain the final answers.