SceneCOT Logo

SceneCOT: Eliciting Chain-of-Thought Reasoning in 3D Scenes

1Beijing Institute for General Artificial Intelligence, 2Peking University, 3Tsinghua University

Abstract

Existing research on 3D Large Language Models (LLMs) still struggles to achieve efficient and explainable reasoning, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework.

We first introduce a Chain-of-Thought reasoning method in 3D scenes (SceneCOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we build the first large-scale 3D scene Chain-of-Thought reasoning dataset SceneCOT-212K, including 212K high-quality data instances.

Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves state-of-the-art with clear interpretability. To our knowledge, this is the first attempt to successfully implement the COT technique for achieving human-like step-by-step reasoning for 3D scene understanding, where we show great potential in extending it to a wider range of 3D scene understanding scenarios.

Examples of Reasoning Chain

Reasoning chain visualization of SceneCOT. The reasoning chain mimics humans' recognition process: from high-level task recognition to low-level visual semantic understanding.

SceneCOT framework

Detailed overview of benchmarking tasks

SceneCOT framework. Our framework enables interpretable step-by-step reasoning in 3D scenes through a Chain-of-Thought (CoT) process.

Qualitative Results

Detailed overview of benchmarking tasks

We provide the illustration of the predicted reasoning chains with the ground-truth annotations. In the first examples, we clearly show the reasoning chains of SceneCOT, which are semanticly consistent with the ground-truth annotations even with some noise. The reasoning chains are interpretable and can be used to explain the final answers.