cs.CV（2026-01-06）

📊 共 22 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (3) 支柱四：生成式动作 (Generative Motion) (1) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation	AnatomiX：面向胸部X光片解读的解剖学感知多模态大语言模型	large language model multimodal	✅
2	Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs	提出TGIF：文本引导层融合缓解多模态LLM中的幻觉问题	large language model multimodal visual grounding
3	Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA	提出多Agent LLM框架，解决卡通VQA中视觉抽象和叙事推理难题	large language model multimodal
4	PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding	PrismVAU：用于多模态视频异常理解的Prompt优化推理系统	large language model multimodal
5	A Versatile Multimodal Agent for Multimedia Content Generation	提出一种多模态Agent，用于自动化复杂多媒体内容生成任务，提升内容创作效率。	multimodal
6	UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision	UniCorn：通过自生成监督提升统一多模态模型的生成能力	multimodal
7	TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors	提出TA-Prompting，通过时序锚点增强VideoLLM在密集视频字幕生成中的时序理解能力。	large language model
8	Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench	PET-Bench揭示MLLM在功能影像感知上的差距，提出AVA方法提升诊断准确率。	large language model multimodal chain-of-thought	✅
9	ClearAIR: A Human-Visual-Perception-Inspired All-in-One Image Restoration	ClearAIR：受人类视觉感知启发的全能图像复原框架，有效解决现有方法过平滑和伪影问题。	large language model multimodal
10	DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation	提出DiffAgent，通过LLM驱动的端到端代码生成加速扩散模型。	large language model
11	Omni2Sound: Towards Unified Video-Text-to-Audio Generation	Omni2Sound：提出统一的视频-文本-音频生成模型，解决多模态对齐和任务竞争问题	multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
12	SA-ResGS: Self-Augmented Residual 3D Gaussian Splatting for Next Best View Selection	SA-ResGS：用于最佳视角选择的自增强残差3D高斯溅射	3D gaussian splatting gaussian splatting splatting
13	InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields	InfiniDepth：提出基于神经隐式场的任意分辨率精细深度估计方法	depth estimation metric depth
14	StableDPT: Temporal Stable Monocular Video Depth Estimation	StableDPT：通过时序建模提升单目视频深度估计的稳定性	depth estimation monocular depth
15	AnyDepth: Depth Estimation Made Easy	AnyDepth：轻量级零样本单目深度估计框架，兼顾效率与泛化性	depth estimation monocular depth	✅
16	CAMO: Category-Agnostic 3D Motion Transfer from Monocular 2D Videos	提出CAMO，解决单目视频到3D模型的类别无关运动迁移问题	3D gaussian splatting gaussian splatting splatting

🔬 支柱二：RL算法与架构 (RL & Architecture) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
17	SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models	提出SketchThinker-R1，提升大模型草图式推理能力并降低计算成本。	reinforcement learning multimodal
18	Flow Matching and Diffusion Models via PointNet for Generating Fluid Fields on Irregular Geometries	提出基于PointNet的流匹配与扩散模型，用于生成不规则几何体上的流体场	flow matching
19	Foreground-Aware Dataset Distillation via Dynamic Patch Selection	提出基于动态前景感知的数据集蒸馏方法，提升小数据集的表征能力。	distillation

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
20	LTX-2: Efficient Joint Audio-Visual Foundation Model	LTX-2：高效联合音视频基础模型，实现高质量同步音视频内容生成	classifier-free guidance foundation model

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Towards Faithful Reasoning in Comics for Small MLLMs	提出漫画推理框架以解决小型MLLMs在CVQA中的性能问题	HuMoR large language model multimodal

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
22	On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning	揭示Transformer图像嵌入在非可解空间推理中的内在局限性	OMOMO

⬅️ 返回 cs.CV 首页 · 🏠 返回主页