cs.CV(2025-05-20)

📊 共 41 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (16 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (12 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗4) 支柱一:机器人控制 (Robot Control) (3) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (16 篇)

#题目一句话要点标签🔗
1 UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning 提出UniVG-R1以解决复杂多模态视觉定位问题 reinforcement learning large language model multimodal
2 UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation 提出UniGen以解决多模态理解与生成的挑战 direct preference optimization large language model multimodal
3 Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning 提出Visionary-R1以解决视觉推理中的快捷学习问题 reinforcement learning large language model multimodal
4 Programmatic Video Prediction Using Large Language Models 提出ProgGen以解决视频帧预测问题 world model large language model
5 Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency 提出TemRobBench与PanoDPO以解决多模态模型的时间一致性问题 direct preference optimization multimodal
6 VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank 提出VisualQuality-R1以解决图像质量评估中的推理不足问题 reinforcement learning large language model
7 DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning 提出DeepEyes以解决多模态推理中的视觉与文本整合问题 reinforcement learning multimodal
8 Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method 提出OmniVQA数据集与360-R1方法以解决全景视觉问答问题 reinforcement learning embodied AI large language model
9 StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning 提出StPR框架以解决视频类增量学习中的遗忘问题 distillation spatiotemporal
10 Intra-class Patch Swap for Self-Distillation 提出基于类内补丁交换的自蒸馏方法以简化知识蒸馏 teacher-student distillation
11 MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks 提出MultiMAE以解决多模态地球观测任务的预训练问题 masked autoencoder
12 RETRO: REthinking Tactile Representation Learning with Material PriOrs 提出材料先验以提升触觉表征学习的准确性 representation learning
13 Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search 提出符号图排序器以解决会话搜索中的信息结构建模问题 contrastive learning large language model
14 Scaling Vision Mamba Across Resolutions via Fractal Traversal 提出FractalMamba++以解决视觉输入分辨率适应性问题 Mamba
15 Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning 提出物理驱动的局部-整体弹性变形建模以提升点云表示学习 representation learning
16 Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels 提出Ground-V以解决复杂指令的像素级定位问题 distillation instruction following

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
17 Speculative Decoding Reimagined for Multimodal Large Language Models 提出多模态推测解码以加速多模态大语言模型推理 large language model multimodal
18 EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language 提出EmoSign数据集以解决手语情感理解问题 multimodal
19 RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding 提出RAVENEA以解决多模态文化理解不足问题 multimodal
20 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models 提出视频压缩指挥官以解决视频大语言模型效率问题 large language model
21 ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations 提出ViC-Bench以解决现有MLLMs评估中IVS固定问题 chain-of-thought
22 LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts 提出LoVR基准以解决长视频检索中的多模态挑战 multimodal
23 Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach 提出Llama-SMoP以解决资源受限环境下的AVSR问题 large language model multimodal
24 RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection 提出RADAR框架以解决放射学报告生成中的知识整合问题 large language model multimodal
25 VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation 提出VideoEval-Pro以解决长视频理解评估的有效性问题 multimodal
26 Unlocking the Power of SAM 2 for Few-Shot Segmentation 提出伪提示生成器与迭代记忆精炼以解决少样本分割问题 foundation model
27 Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting 提出Dolphin以解决文档图像解析中的复杂元素问题 multimodal
28 AppleGrowthVision: A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchards 提出AppleGrowthVision以解决苹果园监测数据集不足问题 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
29 MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene Reconstruction 提出MGStream以解决动态场景重建中的闪烁和存储效率问题 3D gaussian splatting 3DGS gaussian splatting
30 M3Depth: Wavelet-Enhanced Depth Estimation on Mars via Mutual Boosting of Dual-Modal Data 提出M3Depth以解决火星环境下深度估计问题 depth estimation stereo depth
31 Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image 提出CP-GS框架以解决单图像3D场景个性化问题 3D gaussian splatting 3DGS gaussian splatting
32 Multi-Label Stereo Matching for Transparent Scene Depth Estimation 提出多标签立体匹配方法以解决透明场景深度估计问题 depth estimation scene reconstruction
33 Diving into the Fusion of Monocular Priors for Generalized Stereo Matching 提出二元局部排序图以解决立体匹配中的单目先验融合问题 monocular depth scene flow foundation model
34 4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision 提出4D-ROLLS以解决4D雷达占用估计问题 height map

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
35 Emerging Properties in Unified Multimodal Pretraining 提出BAGEL模型以解决多模态理解与生成的挑战 manipulation multimodal
36 Vid2World: Crafting Video Diffusion Models to Interactive World Models 提出Vid2World以解决现有世界模型低保真度问题 manipulation world model
37 Visual Agentic Reinforcement Fine-Tuning 提出视觉代理强化微调方法以提升多模态推理能力 manipulation multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
38 Beyond Words: Multimodal LLM Knows When to Speak 提出MM-When2Speak以解决对话中反应时机预测问题 dyadic interaction large language model multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
39 EGFormer: Towards Efficient and Generalizable Multimodal Semantic Segmentation 提出EGFormer以解决多模态语义分割的效率问题 MDM multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
40 Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance 提出以自我中心动作感知的惯性定位框架解决3D点云中的定位漂移问题 egocentric multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
41 Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI 提出Dynadiff以解决动态fMRI图像解码问题 diff-sim

⬅️ 返回 cs.CV 首页 · 🏠 返回主页