cs.CV(2025-05-29)

📊 共 46 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (25 🔗10) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱四:生成式动作 (Generative Motion) (1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (25 篇)

#题目一句话要点标签🔗
1 Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models 提出Impromptu VLA以解决自动驾驶中的视觉-语言-动作模型挑战 vision-language-action VLA
2 Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought 提出Argus以解决视觉推理中的注意力不足问题 large language model multimodal chain-of-thought
3 Preemptive Hallucination Reduction: An Input-Level Approach for Multimodal Language Model 提出预防性幻觉减少方法以解决多模态语言模型的幻觉问题 large language model multimodal
4 OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation 提出OpenUni以实现多模态理解与生成的统一 large language model multimodal
5 MaskAdapt: Unsupervised Geometry-Aware Domain Adaptation Using Multimodal Contextual Learning and RGB-Depth Masking 提出MaskAdapt以解决农业领域无监督域适应问题 multimodal
6 Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence 提出Spatial-MLLM以解决视觉基础空间智能问题 large language model foundation model multimodal
7 FMG-Det: Foundation Model Guided Robust Object Detection 提出FMG-Det以解决噪声标注下的物体检测问题 foundation model
8 VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos 提出VF-Eval以评估多模态LLM在AIGC视频反馈生成中的能力 multimodal
9 EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis 提出EndoBench以解决内窥镜分析多模态模型评估不足问题 large language model
10 OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data 提出OmniEarth-Bench以解决地球六大圈层及其交互的评估问题 multimodal
11 VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning 提出VAU-R1以解决视频异常理解中的推理能力不足问题 large language model multimodal chain-of-thought
12 MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification 提出MCFNet以解决多模态信息融合中的细粒度语义分类问题 multimodal
13 VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? 提出VideoReasonBench以解决视频理解中的复杂推理问题 large language model multimodal chain-of-thought
14 ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks 提出ThinkGeo以评估工具增强代理在遥感任务中的表现 large language model multimodal
15 Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications 提出元数据增强模型以解决文化遗产数字化中的元数据不足问题 large language model TAMP
16 CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection 提出CMIE框架以解决多模态大语言模型在虚假信息检测中的不足 large language model multimodal
17 Vid-SME: Membership Inference Attacks against Large Video Understanding Models 提出Vid-SME以解决视频理解模型的成员推断攻击问题 large language model multimodal
18 DGIQA: Depth-guided Feature Attention and Refinement for Generalizable Image Quality Assessment 提出DGIQA以解决无参考图像质量评估中的泛化问题 multimodal
19 VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL 提出VisualSphinx以解决视觉语言模型训练数据不足问题 multimodal
20 ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding 提出ScaleLong基准以解决长视频理解中的多时间尺度问题 multimodal
21 D-AR: Diffusion via Autoregressive Models 提出D-AR以重构图像扩散过程为自回归模型 large language model
22 ZeroSep: Separate Anything in Audio with Zero Training 提出ZeroSep以实现音频源的零训练分离 foundation model
23 Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition 提出Uni-MuMER以解决手写数学表达式识别问题 chain-of-thought
24 TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models 提出TerraIncognita以解决昆虫物种发现的挑战 multimodal
25 VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation 提出VCapsBench以解决视频字幕质量评估不足问题 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
26 DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models 提出DINO-R1以增强视觉基础模型的推理能力 reinforcement learning open-vocabulary open vocabulary
27 UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning 提出UniRL以解决多模态模型后训练数据依赖问题 reinforcement learning large language model multimodal
28 VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models 提出VideoREPA以解决视频生成中的物理理解问题 distillation physically plausible foundation model
29 Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles 提出基于规则的视觉强化学习方法以解决多模态学习挑战 reinforcement learning large language model multimodal
30 UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors 提出UrbanCraft以解决城市场景外推问题 distillation scene reconstruction occupancy grid
31 PixelThink: Towards Efficient Chain-of-Pixel Reasoning 提出PixelThink以解决多模态推理效率低下问题 reinforcement learning large language model multimodal
32 BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning 提出BioCLIP 2以解决生物视觉模型的能力提升问题 contrastive learning foundation model
33 Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization 提出人类偏好对齐的扩散框架以解决动态肖像动画问题 direct preference optimization spatiotemporal
34 Grounded Reinforcement Learning for Visual Reasoning 提出ViGoRL以解决视觉推理中的空间定位问题 reinforcement learning
35 Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching 提出模型对齐耦合方法以解决流匹配中的路径交叉问题 flow matching

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
36 Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation 提出BriGeS以解决单目深度估计中的几何与语义融合问题 depth estimation monocular depth foundation model
37 AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views 提出AnySplat以解决无标定视图下的新视图合成问题 3D gaussian splatting gaussian splatting splatting
38 ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS 提出ZPressor以解决3D高斯点云模型的可扩展性问题 3D gaussian splatting 3DGS gaussian splatting
39 MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence 提出MMSI-Bench以解决多图像空间智能评估问题 scene reconstruction large language model multimodal
40 TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models 提出TextRegion以解决图像文本模型在细节理解上的不足 open-vocabulary open vocabulary
41 CLDTracker: A Comprehensive Language Description for Visual Tracking 提出CLDTracker以解决视觉跟踪中的语言描述不足问题 open-vocabulary open vocabulary
42 PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views 提出PhysicsNeRF以解决稀疏视图下的3D重建问题 NeRF neural radiance field

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
43 Semantics-Aware Human Motion Generation from Audio Instructions 提出基于音频指令的人体动作生成框架以解决语义匹配问题 motion generation

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
44 To Trust Or Not To Trust Your Vision-Language Model's Prediction 提出TrustVLM以解决视觉语言模型预测可信度问题 IMoS multimodal

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
45 Weakly-supervised Localization of Manipulated Image Regions Using Multi-resolution Learned Features 提出弱监督方法以解决图像篡改区域定位问题 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
46 VITON-DRR: Details Retention Virtual Try-on via Non-rigid Registration 提出VITON-DRR以解决虚拟试衣中细节保留问题 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页