SceneCOT: Eliciting Chain-of-Thought Reasoning in 3D Scenes

Abstract

Existing research on 3D Large Language Models (LLMs) still struggles to achieve efficient and explainable reasoning, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework.

We first introduce a Chain-of-Thought reasoning method in 3D scenes (SceneCOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we build the first large-scale 3D scene Chain-of-Thought reasoning dataset SceneCOT-185K, including 185K high-quality data instances.

Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves state-of-the-art with clear interpretability. To our knowledge, this is the first attempt to successfully implement the COT technique for achieving human-like step-by-step reasoning for 3D scene understanding, where we show great potential in extending it to a wider range of 3D scene understanding scenarios.

Qualitative Results

We provide the illustration of the predicted reasoning chains with the ground-truth annotations. In the first examples, we clearly show the reasoning chains of SceneCOT, which are semanticly consistent with the ground-truth annotations even with some noise. The reasoning chains are interpretable and can be used to explain the final answers.

SceneCOT: Eliciting Chain-of-Thought Reasoning in 3D Scenes

Abstract

Examples of Reasoning Chain

SceneCOT framework

Qualitative Results