cs.CV(2025-05-26)

📊 共 48 篇论文 | 🔗 18 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (15 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (15 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗4) 支柱一:机器人控制 (Robot Control) (3 🔗2) 支柱四:生成式动作 (Generative Motion) (2 🔗2) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (15 篇)

#题目一句话要点标签🔗
1 What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models 提出DICE以解决图像编辑结果评估问题 distillation large language model multimodal
2 From Data to Modeling: Fully Open-vocabulary Scene Graph Generation 提出OvSGTR以解决传统场景图生成的开放词汇问题 distillation open-vocabulary open vocabulary
3 Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought 提出Vad-R1以解决视频异常推理问题 reinforcement learning large language model multimodal
4 MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models 提出MMGeoLM以解决大规模多模态模型的几何理解问题 contrastive learning multimodal
5 Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval 提出多模态推理代理以解决零样本组合图像检索问题 contrastive learning large language model multimodal
6 FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields 提出FruitNeRF++以解决多种水果计数问题 contrastive learning neural radiance field foundation model
7 Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models 通过微调自然领域基础模型提升医学图像分类性能 Mamba MAE foundation model
8 ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers 提出ViTaPEs以解决多模态对齐问题 representation learning multimodal
9 ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving 提出ReasonPlan以解决闭环自主驾驶中的决策推理问题 imitation learning large language model multimodal
10 Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration 提出Omni-R1以解决长视频音频推理与细粒度像素理解的矛盾问题 reinforcement learning foundation model multimodal
11 FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities 提出FUDOKI以解决多模态大语言模型的局限性问题 reinforcement learning flow matching large language model
12 Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning 提出Ground-R1以解决视觉推理中的监督成本问题 reinforcement learning chain-of-thought
13 Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling 提出训练无关技术以提升文本到3D生成质量 distillation classifier-free guidance
14 Long-Context State-Space Video World Models 提出长上下文状态空间视频世界模型以解决长时记忆问题 world model SSM
15 VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection 提出VisTA框架以解决工具选择的动态探索问题 reinforcement learning

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
16 CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic 提出CPathAgent以解决病理图像分析中的可解释性问题 foundation model multimodal
17 Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion 提出MLLM引导的语义校正方法以解决文本到图像生成中的语义不一致问题 large language model multimodal
18 Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM 提出Dynamic-I2V以解决复杂场景下图像到视频生成问题 large language model multimodal
19 StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation 提出StyleAR以解决风格对齐文本到图像生成问题 multimodal instruction following
20 MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness 提出MMPerspective以评估多模态大语言模型的视角理解能力 large language model multimodal chain-of-thought
21 PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology 提出PathBench以解决病理基础模型评估标准化问题 foundation model
22 AdaTP: Attention-Debiased Token Pruning for Video Large Language Models 提出AdaTP以解决视频大语言模型中的注意力偏差问题 large language model
23 Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models 提出视觉信心感知提示以解决视觉语言模型的校准问题 large language model multimodal
24 Efficient Multi-modal Long Context Learning for Training-free Adaptation 提出EMLoC以解决多模态大语言模型适应性问题 large language model multimodal
25 DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data 提出DIPO框架以实现可控的关节物体生成 chain-of-thought
26 Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts 提出Benign-to-Toxic方法以解决安全机制失效问题 multimodal
27 HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters 提出HunyuanVideo-Avatar以解决多角色音频驱动人类动画问题 multimodal
28 Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks 提出CoP基准数据集以解决视频到钢琴音乐生成的评估问题 multimodal
29 Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models 提出原子视觉技能以解决视觉语言模型的基本任务挑战 multimodal
30 NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification 提出NEXT框架以解决多模态物体重识别中的细粒度特征建模问题 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
31 CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting 提出CCL-LGS以解决3D语义理解中的视角不一致问题 gaussian splatting splatting
32 OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using Blender 提出OB3D数据集以解决全景3D重建中的几何失真问题 3D gaussian splatting 3DGS gaussian splatting
33 Sparse2DGS: Sparse-View Surface Reconstruction using 2D Gaussian Splatting with Dense Point Cloud 提出Sparse2DGS以解决稀疏视图下的3D重建问题 gaussian splatting splatting
34 Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction 提出深度引导的束采样策略以加速神经辐射场重建 NeRF neural radiance field
35 Total-Editing: Head Avatar with Editable Appearance, Motion, and Lighting 提出Total-Editing框架以实现头像的可编辑外观、运动与光照 neural radiance field geometric consistency spatiotemporal
36 GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis 提出GoLF-NRT以解决少量视图合成质量下降问题 NeRF neural radiance field scene reconstruction
37 Weather-Magician: Reconstruction and Rendering Framework for 4D Weather Synthesis In Real Time 提出Weather-Magician框架以解决实时天气合成问题 gaussian splatting splatting
38 ErpGS: Equirectangular Image Rendering enhanced with 3D Gaussian Regularization 提出ErpGS以解决360度图像渲染失真问题 3DGS NeRF

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
39 In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation 提出In-Context Brush以解决定制化主题插入问题 manipulation multimodal
40 ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image 提出ControlTac以解决大规模触觉数据收集成本高的问题 manipulation physically plausible
41 Attention! Your Vision Language Model Could Be Maliciously Manipulated 提出视觉语言模型操控攻击以应对模型脆弱性问题 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
42 PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation 提出PAMD以解决长舞蹈生成中的物理合理性问题 motion diffusion model motion diffusion physically plausible
43 MotionPro: A Precise Motion Controller for Image-to-Video Generation 提出MotionPro以解决图像到视频生成中的精确运动控制问题 motion synthesis

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
44 Agentic 3D Scene Generation with Spatially Contextualized VLMs 提出一种新范式以解决VLM在3D场景生成中的局限性 spatial relationship embodied AI multimodal
45 VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction 提出VLM-3R以解决3D场景理解的挑战 spatial relationship multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
46 Electrolyzers-HSI: Close-Range Multi-Scene Hyperspectral Imaging Benchmark Dataset 提出Electrolyzers-HSI数据集以加速电解器材料分类 HSI multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
47 AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models 提出AniCrafter以解决动态背景下人类动画的局限性问题 SMPL SMPL-X character animation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
48 Structured Initialization for Vision Transformers 提出结构化初始化方法以提升视觉变换器性能 PULSE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页