cs.CV(2026-01-06)

📊 共 22 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (3) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation AnatomiX:面向胸部X光片解读的解剖学感知多模态大语言模型 large language model multimodal
2 Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs 提出TGIF:文本引导层融合缓解多模态LLM中的幻觉问题 large language model multimodal visual grounding
3 Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA 提出多Agent LLM框架,解决卡通VQA中视觉抽象和叙事推理难题 large language model multimodal
4 PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding PrismVAU:用于多模态视频异常理解的Prompt优化推理系统 large language model multimodal
5 A Versatile Multimodal Agent for Multimedia Content Generation 提出一种多模态Agent,用于自动化复杂多媒体内容生成任务,提升内容创作效率。 multimodal
6 UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision UniCorn:通过自生成监督提升统一多模态模型的生成能力 multimodal
7 TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors 提出TA-Prompting,通过时序锚点增强VideoLLM在密集视频字幕生成中的时序理解能力。 large language model
8 Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench PET-Bench揭示MLLM在功能影像感知上的差距,提出AVA方法提升诊断准确率。 large language model multimodal chain-of-thought
9 ClearAIR: A Human-Visual-Perception-Inspired All-in-One Image Restoration ClearAIR:受人类视觉感知启发的全能图像复原框架,有效解决现有方法过平滑和伪影问题。 large language model multimodal
10 DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation 提出DiffAgent,通过LLM驱动的端到端代码生成加速扩散模型。 large language model
11 Omni2Sound: Towards Unified Video-Text-to-Audio Generation Omni2Sound:提出统一的视频-文本-音频生成模型,解决多模态对齐和任务竞争问题 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
12 SA-ResGS: Self-Augmented Residual 3D Gaussian Splatting for Next Best View Selection SA-ResGS:用于最佳视角选择的自增强残差3D高斯溅射 3D gaussian splatting gaussian splatting splatting
13 InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields InfiniDepth:提出基于神经隐式场的任意分辨率精细深度估计方法 depth estimation metric depth
14 StableDPT: Temporal Stable Monocular Video Depth Estimation StableDPT:通过时序建模提升单目视频深度估计的稳定性 depth estimation monocular depth
15 AnyDepth: Depth Estimation Made Easy AnyDepth:轻量级零样本单目深度估计框架,兼顾效率与泛化性 depth estimation monocular depth
16 CAMO: Category-Agnostic 3D Motion Transfer from Monocular 2D Videos 提出CAMO,解决单目视频到3D模型的类别无关运动迁移问题 3D gaussian splatting gaussian splatting splatting

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
17 SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models 提出SketchThinker-R1,提升大模型草图式推理能力并降低计算成本。 reinforcement learning multimodal
18 Flow Matching and Diffusion Models via PointNet for Generating Fluid Fields on Irregular Geometries 提出基于PointNet的流匹配与扩散模型,用于生成不规则几何体上的流体场 flow matching
19 Foreground-Aware Dataset Distillation via Dynamic Patch Selection 提出基于动态前景感知的数据集蒸馏方法,提升小数据集的表征能力。 distillation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
20 LTX-2: Efficient Joint Audio-Visual Foundation Model LTX-2:高效联合音视频基础模型,实现高质量同步音视频内容生成 classifier-free guidance foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
21 Towards Faithful Reasoning in Comics for Small MLLMs 提出漫画推理框架以解决小型MLLMs在CVQA中的性能问题 HuMoR large language model multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
22 On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning 揭示Transformer图像嵌入在非可解空间推理中的内在局限性 OMOMO

⬅️ 返回 cs.CV 首页 · 🏠 返回主页