The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge

作者: Haoru Wang, Kai Ye, Yangyan Li, Wenzheng Chen, Baoquan Chen

分类: cs.CV

发布日期: 2025-06-11

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出一种新颖的视图合成方法以解决稀疏无姿态图像的问题

🎯 匹配领域: 支柱三：空间感知与语义 (Perception & Semantics)

关键词: 新视图合成 稀疏图像 无姿态图像 3D知识 深度学习 计算机视觉 数据驱动

📋 核心要点

现有方法通常依赖于强大的3D知识和真实的相机姿态，限制了其在稀疏或无姿态图像上的应用。
本文提出了一种新颖的NVS框架，旨在最小化3D归纳偏差和姿态依赖，从而直接从稀疏2D图像中学习隐式3D意识。
实验结果表明，所提方法在生成的视图质量上与依赖姿态输入的方法相当，验证了其在大规模数据下的有效性。

📝 摘要（中文）

本文探讨了可推广的新视图合成（NVS）问题，旨在从稀疏甚至无姿态的2D图像中生成逼真的新视图，而无需针对每个场景进行优化。这一任务具有挑战性，因为它需要从不完整和模糊的2D观察中推断3D结构。早期方法通常依赖于强大的3D知识，包括将显式3D表示（如NeRF或3DGS）嵌入网络设计中，以及输入和目标视图的真实相机姿态。本文通过系统分析3D知识，发现依赖较少3D知识的方法在数据规模增加时性能提升更快，最终与依赖3D知识的方法相当。基于这一趋势，本文提出了一种新颖的NVS框架，最小化3D归纳偏差和姿态依赖，充分利用数据规模，从稀疏2D图像中直接学习隐式3D意识。实验表明，该模型生成的视图在逼真性和3D一致性上表现优异，验证了数据驱动范式的可行性和有效性。

🔬 方法详解

问题定义：本文旨在解决从稀疏或无姿态的2D图像生成新视图的问题。现有方法依赖于强大的3D知识和真实的相机姿态，限制了其在实际应用中的灵活性和适用性。

核心思路：论文的核心思路是通过最小化对3D知识的依赖，直接从稀疏的2D图像中学习隐式的3D意识。这种设计使得模型能够在没有明确3D信息的情况下，依然生成高质量的新视图。

技术框架：整体架构包括数据输入模块、特征提取模块和视图合成模块。数据输入模块负责接收稀疏的2D图像，特征提取模块通过深度学习网络提取图像特征，视图合成模块则利用提取的特征生成新的视图。

关键创新：最重要的技术创新在于提出了一种数据驱动的NVS框架，显著减少了对3D知识的依赖。这一方法与传统依赖明确3D表示的技术有本质区别，强调了在大规模数据环境下的学习能力。

关键设计：在网络结构上，采用了深度卷积神经网络（CNN）进行特征提取，并设计了特定的损失函数以优化生成视图的质量。模型在训练过程中不需要任何姿态标注，进一步降低了对外部信息的依赖。

📊 实验亮点

实验结果显示，所提方法在生成的视图质量上与依赖姿态输入的基线方法相当，且在数据规模增加时，性能提升显著。具体而言，所提方法在多个数据集上均表现出优异的3D一致性和视觉真实感，验证了其有效性。

🎯 应用场景

该研究的潜在应用领域包括虚拟现实、增强现实以及计算机图形学等领域，能够为这些领域提供高质量的视图合成技术。通过减少对3D知识的依赖，该方法在处理稀疏数据时具有更高的灵活性和适应性，未来可能推动相关技术的广泛应用。

📄 摘要（原文）

We consider the problem of generalizable novel view synthesis (NVS), which aims to generate photorealistic novel views from sparse or even unposed 2D images without per-scene optimization. This task remains fundamentally challenging, as it requires inferring 3D structure from incomplete and ambiguous 2D observations. Early approaches typically rely on strong 3D knowledge, including architectural 3D inductive biases (e.g., embedding explicit 3D representations, such as NeRF or 3DGS, into network design) and ground-truth camera poses for both input and target views. While recent efforts have sought to reduce the 3D inductive bias or the dependence on known camera poses of input views, critical questions regarding the role of 3D knowledge and the necessity of circumventing its use remain under-explored. In this work, we conduct a systematic analysis on the 3D knowledge and uncover a critical trend: the performance of methods that requires less 3D knowledge accelerates more as data scales, eventually achieving performance on par with their 3D knowledge-driven counterparts, which highlights the increasing importance of reducing dependence on 3D knowledge in the era of large-scale data. Motivated by and following this trend, we propose a novel NVS framework that minimizes 3D inductive bias and pose dependence for both input and target views. By eliminating this 3D knowledge, our method fully leverages data scaling and learns implicit 3D awareness directly from sparse 2D images, without any 3D inductive bias or pose annotation during training. Extensive experiments demonstrate that our model generates photorealistic and 3D-consistent novel views, achieving even comparable performance with methods that rely on posed inputs, thereby validating the feasibility and effectiveness of our data-centric paradigm. Project page: https://pku-vcl-geometry.github.io/Less3Depend/ .

The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册