Accurate and efficient zero-shot 6D pose estimation with frozen foundation models

作者: Andrea Caraffa, Davide Boscaini, Fabio Poiesi

分类: cs.CV

发布日期: 2025-06-11

备注: Technical report

💡 一句话要点

提出FreeZeV2以解决零-shot 6D姿态估计问题

🎯 匹配领域: 支柱三：空间感知与语义 (Perception & Semantics) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 6D姿态估计 零-shot学习 计算机视觉 机器人技术 增强现实 特征提取 模块化设计

📋 核心要点

现有方法在零-shot 6D姿态估计中面临泛化能力不足和计算资源消耗大的挑战。
FreeZeV2通过无训练的方法，利用预训练的基础模型，实现对未见物体的强泛化能力，提升了准确性和效率。
在BOP基准测试中，FreeZeV2相比FreeZe实现了8倍的速度提升和5%的准确性提升，展示了其优越性。

📝 摘要（中文）

从RGBD数据中估计物体的6D姿态是计算机视觉中的一个基本问题，广泛应用于机器人技术和增强现实。现有方法通常依赖于针对特定任务的合成数据进行训练，消耗大量计算资源。为了解决这一问题，本文提出了FreeZeV2，这是一种无训练的方法，通过利用在无关数据上预训练的几何和视觉基础模型，实现对未见物体的强泛化能力。FreeZeV2在准确性和效率上均优于其前身FreeZe，主要贡献包括稀疏特征提取策略、特征感知评分机制和模块化设计。实验结果表明，FreeZeV2在BOP基准的七个核心数据集上建立了新的6D姿态估计的最先进水平。

🔬 方法详解

问题定义：本文旨在解决从RGBD数据中进行零-shot 6D姿态估计的问题。现有方法通常依赖于大量合成数据进行特定任务的训练，导致计算资源消耗巨大且泛化能力不足。

核心思路：FreeZeV2提出了一种无训练的方法，通过利用在无关数据上预训练的几何和视觉基础模型，来实现对未见物体的强泛化能力。这种设计旨在减少对特定任务训练的依赖，从而提高效率和准确性。

技术框架：FreeZeV2的整体架构包括三个主要模块：稀疏特征提取模块、特征感知评分机制和模块化设计。稀疏特征提取模块负责在推理时减少计算量，特征感知评分机制则用于优化RANSAC基础的3D配准过程中的姿态选择和候选姿态的最终排名。模块化设计允许集成多个实例分割模型，从而增强对分割掩膜错误的鲁棒性。

关键创新：FreeZeV2的主要创新在于其无训练的特性和高效的特征提取策略，与现有方法相比，显著降低了计算需求，同时保持了高准确性。

关键设计：在设计中，FreeZeV2采用了稀疏特征提取策略，优化了推理过程中的计算效率。此外，特征感知评分机制通过考虑特征信息来提升姿态选择的准确性，模块化设计则增强了系统的灵活性和鲁棒性。具体的参数设置和损失函数细节在论文中有详细描述。

📊 实验亮点

FreeZeV2在BOP基准测试中表现出色，相比FreeZe实现了8倍的速度提升和5%的准确性提升。当使用多个分割模型集成时，准确性进一步提升8%，同时运行速度仍比FreeZe快2.5倍。这些结果表明FreeZeV2在零-shot 6D姿态估计中的优越性。

🎯 应用场景

该研究的潜在应用领域包括机器人导航、增强现实和自动驾驶等。通过提高6D姿态估计的准确性和效率，FreeZeV2能够在实际场景中更好地处理未见物体的识别与定位，推动相关技术的发展与应用。

📄 摘要（原文）

Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.

Accurate and efficient zero-shot 6D pose estimation with frozen foundation models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册