cs.CV（2025-05-20）

📊 共 41 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (16 🔗4) 支柱九：具身大模型 (Embodied Foundation Models) (12 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗4) 支柱一：机器人控制 (Robot Control) (3) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱四：生成式动作 (Generative Motion) (1) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning	提出UniVG-R1以解决复杂多模态视觉定位问题	reinforcement learning large language model multimodal	✅
2	UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation	提出UniGen以解决多模态理解与生成的挑战	direct preference optimization large language model multimodal
3	Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning	提出Visionary-R1以解决视觉推理中的快捷学习问题	reinforcement learning large language model multimodal
4	Programmatic Video Prediction Using Large Language Models	提出ProgGen以解决视频帧预测问题	world model large language model
5	Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency	提出TemRobBench与PanoDPO以解决多模态模型的时间一致性问题	direct preference optimization multimodal
6	VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank	提出VisualQuality-R1以解决图像质量评估中的推理不足问题	reinforcement learning large language model
7	DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning	提出DeepEyes以解决多模态推理中的视觉与文本整合问题	reinforcement learning multimodal	✅
8	Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method	提出OmniVQA数据集与360-R1方法以解决全景视觉问答问题	reinforcement learning embodied AI large language model
9	StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning	提出StPR框架以解决视频类增量学习中的遗忘问题	distillation spatiotemporal
10	Intra-class Patch Swap for Self-Distillation	提出基于类内补丁交换的自蒸馏方法以简化知识蒸馏	teacher-student distillation	✅
11	MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks	提出MultiMAE以解决多模态地球观测任务的预训练问题	masked autoencoder	✅
12	RETRO: REthinking Tactile Representation Learning with Material PriOrs	提出材料先验以提升触觉表征学习的准确性	representation learning
13	Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search	提出符号图排序器以解决会话搜索中的信息结构建模问题	contrastive learning large language model
14	Scaling Vision Mamba Across Resolutions via Fractal Traversal	提出FractalMamba++以解决视觉输入分辨率适应性问题	Mamba
15	Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning	提出物理驱动的局部-整体弹性变形建模以提升点云表示学习	representation learning
16	Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels	提出Ground-V以解决复杂指令的像素级定位问题	distillation instruction following

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
17	Speculative Decoding Reimagined for Multimodal Large Language Models	提出多模态推测解码以加速多模态大语言模型推理	large language model multimodal	✅
18	EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language	提出EmoSign数据集以解决手语情感理解问题	multimodal	✅
19	RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding	提出RAVENEA以解决多模态文化理解不足问题	multimodal
20	Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models	提出视频压缩指挥官以解决视频大语言模型效率问题	large language model	✅
21	ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations	提出ViC-Bench以解决现有MLLMs评估中IVS固定问题	chain-of-thought
22	LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts	提出LoVR基准以解决长视频检索中的多模态挑战	multimodal	✅
23	Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach	提出Llama-SMoP以解决资源受限环境下的AVSR问题	large language model multimodal
24	RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection	提出RADAR框架以解决放射学报告生成中的知识整合问题	large language model multimodal
25	VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation	提出VideoEval-Pro以解决长视频理解评估的有效性问题	multimodal
26	Unlocking the Power of SAM 2 for Few-Shot Segmentation	提出伪提示生成器与迭代记忆精炼以解决少样本分割问题	foundation model
27	Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting	提出Dolphin以解决文档图像解析中的复杂元素问题	multimodal	✅
28	AppleGrowthVision: A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchards	提出AppleGrowthVision以解决苹果园监测数据集不足问题	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
29	MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene Reconstruction	提出MGStream以解决动态场景重建中的闪烁和存储效率问题	3D gaussian splatting 3DGS gaussian splatting	✅
30	M3Depth: Wavelet-Enhanced Depth Estimation on Mars via Mutual Boosting of Dual-Modal Data	提出M3Depth以解决火星环境下深度估计问题	depth estimation stereo depth
31	Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image	提出CP-GS框架以解决单图像3D场景个性化问题	3D gaussian splatting 3DGS gaussian splatting	✅
32	Multi-Label Stereo Matching for Transparent Scene Depth Estimation	提出多标签立体匹配方法以解决透明场景深度估计问题	depth estimation scene reconstruction	✅
33	Diving into the Fusion of Monocular Priors for Generalized Stereo Matching	提出二元局部排序图以解决立体匹配中的单目先验融合问题	monocular depth scene flow foundation model
34	4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision	提出4D-ROLLS以解决4D雷达占用估计问题	height map	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
35	Emerging Properties in Unified Multimodal Pretraining	提出BAGEL模型以解决多模态理解与生成的挑战	manipulation multimodal
36	Vid2World: Crafting Video Diffusion Models to Interactive World Models	提出Vid2World以解决现有世界模型低保真度问题	manipulation world model
37	Visual Agentic Reinforcement Fine-Tuning	提出视觉代理强化微调方法以提升多模态推理能力	manipulation multimodal

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
38	Beyond Words: Multimodal LLM Knows When to Speak	提出MM-When2Speak以解决对话中反应时机预测问题	dyadic interaction large language model multimodal

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
39	EGFormer: Towards Efficient and Generalizable Multimodal Semantic Segmentation	提出EGFormer以解决多模态语义分割的效率问题	MDM multimodal

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
40	Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance	提出以自我中心动作感知的惯性定位框架解决3D点云中的定位漂移问题	egocentric multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
41	Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI	提出Dynadiff以解决动态fMRI图像解码问题	diff-sim

⬅️ 返回 cs.CV 首页 · 🏠 返回主页