Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

作者: Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, Lei Zhang

分类: cs.CV

发布日期: 2025-06-04

备注: homepage: https://rexthinker.github.io/

💡 一句话要点

提出Rex-Thinker以解决对象指称的可解释性与可靠性问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 对象指称 链式推理 可解释性 计算机视觉 深度学习 强化学习 数据集构建

📋 核心要点

现有对象指称方法通常将其视为直接的边界框预测任务，导致可解释性不足，难以拒绝无匹配对象的表达。
Rex-Thinker模型将对象指称视为链式推理任务，通过逐步推理来评估候选对象与给定表达的匹配程度。
实验结果显示，Rex-Thinker在精度和可解释性上超越了标准基线，并在拒绝虚假输出和跨域泛化能力上表现出色。

📝 摘要（中文）

对象指称旨在检测与给定自然语言描述匹配的图像中的所有对象。我们认为，一个稳健的对象指称模型应具备扎根性，即其预测应可解释且忠实于视觉内容。具体而言，它应满足两个关键属性：1) 可验证性，通过生成可解释的推理来证明其预测，并清晰地将其与视觉证据关联；2) 可信性，能够在图像中没有满足给定表达的对象时选择不作预测。大多数方法将指称视为直接的边界框预测任务，导致可解释性有限，并难以拒绝没有匹配对象的表达。为此，我们提出了Rex-Thinker模型，将对象指称明确地构建为链式推理任务。我们构建了一个名为HumanRef-CoT的大规模数据集，以支持这一范式。实验表明，我们的方法在精度和可解释性上均优于标准基线。

🔬 方法详解

问题定义：本论文旨在解决对象指称任务中的可解释性和可靠性问题。现有方法往往将其简化为边界框预测，导致模型难以提供可解释的推理过程，并且在没有匹配对象时无法有效拒绝不合适的表达。

核心思路：Rex-Thinker通过将对象指称任务转化为链式推理任务，逐步评估候选对象与给定表达的匹配程度，从而实现可解释的推理过程。该设计旨在提高模型的透明度和信任度。

技术框架：Rex-Thinker的整体架构包括两个主要阶段：首先，通过识别与所指对象类别对应的候选对象实例；其次，针对每个候选对象进行逐步推理，最终做出预测。模型的训练分为冷启动的监督微调阶段和基于GRPO的强化学习阶段，以提升准确性和泛化能力。

关键创新：Rex-Thinker的核心创新在于将对象指称明确构建为链式推理任务，这与传统的直接边界框预测方法本质上有所不同。通过这种方式，模型能够生成可解释的推理链，并有效拒绝不合适的表达。

关键设计：在模型设计中，采用了结构化的推理格式，包括规划、行动和总结三个步骤。损失函数和网络结构经过精心设计，以支持模型的逐步推理能力和准确性。

📊 实验亮点

实验结果表明，Rex-Thinker在精度和可解释性上均超过了标准基线，具体表现为在精度上提升了X%，在可解释性评分上提高了Y%。此外，模型在拒绝虚假输出方面表现出显著优势，并在跨域设置中展现了强大的泛化能力。

🎯 应用场景

Rex-Thinker在计算机视觉领域的对象指称任务中具有广泛的应用潜力，尤其是在需要高可解释性和可靠性的场景，如自动驾驶、智能监控和人机交互等。未来，该模型的设计理念和方法可以推广到其他需要解释性推理的任务中，推动相关领域的发展。

📄 摘要（原文）

Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.

Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册