cs.CV（2025-05-29）

📊 共 46 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (25 🔗10) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗5) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱四：生成式动作 (Generative Motion) (1) 支柱五：交互与反应 (Interaction & Reaction) (1 🔗1) 支柱一：机器人控制 (Robot Control) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (25 篇)

#	题目	一句话要点	标签	🔗
1	Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models	提出Impromptu VLA以解决自动驾驶中的视觉-语言-动作模型挑战	vision-language-action VLA	✅
2	Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought	提出Argus以解决视觉推理中的注意力不足问题	large language model multimodal chain-of-thought	✅
3	Preemptive Hallucination Reduction: An Input-Level Approach for Multimodal Language Model	提出预防性幻觉减少方法以解决多模态语言模型的幻觉问题	large language model multimodal
4	OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation	提出OpenUni以实现多模态理解与生成的统一	large language model multimodal	✅
5	MaskAdapt: Unsupervised Geometry-Aware Domain Adaptation Using Multimodal Contextual Learning and RGB-Depth Masking	提出MaskAdapt以解决农业领域无监督域适应问题	multimodal
6	Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence	提出Spatial-MLLM以解决视觉基础空间智能问题	large language model foundation model multimodal	✅
7	FMG-Det: Foundation Model Guided Robust Object Detection	提出FMG-Det以解决噪声标注下的物体检测问题	foundation model
8	VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos	提出VF-Eval以评估多模态LLM在AIGC视频反馈生成中的能力	multimodal
9	EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis	提出EndoBench以解决内窥镜分析多模态模型评估不足问题	large language model
10	OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data	提出OmniEarth-Bench以解决地球六大圈层及其交互的评估问题	multimodal
11	VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning	提出VAU-R1以解决视频异常理解中的推理能力不足问题	large language model multimodal chain-of-thought	✅
12	MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification	提出MCFNet以解决多模态信息融合中的细粒度语义分类问题	multimodal
13	VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?	提出VideoReasonBench以解决视频理解中的复杂推理问题	large language model multimodal chain-of-thought
14	ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks	提出ThinkGeo以评估工具增强代理在遥感任务中的表现	large language model multimodal
15	Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications	提出元数据增强模型以解决文化遗产数字化中的元数据不足问题	large language model TAMP
16	CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection	提出CMIE框架以解决多模态大语言模型在虚假信息检测中的不足	large language model multimodal
17	Vid-SME: Membership Inference Attacks against Large Video Understanding Models	提出Vid-SME以解决视频理解模型的成员推断攻击问题	large language model multimodal
18	DGIQA: Depth-guided Feature Attention and Refinement for Generalizable Image Quality Assessment	提出DGIQA以解决无参考图像质量评估中的泛化问题	multimodal
19	VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL	提出VisualSphinx以解决视觉语言模型训练数据不足问题	multimodal
20	ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding	提出ScaleLong基准以解决长视频理解中的多时间尺度问题	multimodal	✅
21	D-AR: Diffusion via Autoregressive Models	提出D-AR以重构图像扩散过程为自回归模型	large language model	✅
22	ZeroSep: Separate Anything in Audio with Zero Training	提出ZeroSep以实现音频源的零训练分离	foundation model
23	Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition	提出Uni-MuMER以解决手写数学表达式识别问题	chain-of-thought	✅
24	TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models	提出TerraIncognita以解决昆虫物种发现的挑战	multimodal	✅
25	VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation	提出VCapsBench以解决视频字幕质量评估不足问题	large language model	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗
26	DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models	提出DINO-R1以增强视觉基础模型的推理能力	reinforcement learning open-vocabulary open vocabulary
27	UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning	提出UniRL以解决多模态模型后训练数据依赖问题	reinforcement learning large language model multimodal	✅
28	VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models	提出VideoREPA以解决视频生成中的物理理解问题	distillation physically plausible foundation model	✅
29	Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles	提出基于规则的视觉强化学习方法以解决多模态学习挑战	reinforcement learning large language model multimodal	✅
30	UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors	提出UrbanCraft以解决城市场景外推问题	distillation scene reconstruction occupancy grid
31	PixelThink: Towards Efficient Chain-of-Pixel Reasoning	提出PixelThink以解决多模态推理效率低下问题	reinforcement learning large language model multimodal
32	BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning	提出BioCLIP 2以解决生物视觉模型的能力提升问题	contrastive learning foundation model
33	Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization	提出人类偏好对齐的扩散框架以解决动态肖像动画问题	direct preference optimization spatiotemporal	✅
34	Grounded Reinforcement Learning for Visual Reasoning	提出ViGoRL以解决视觉推理中的空间定位问题	reinforcement learning
35	Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching	提出模型对齐耦合方法以解决流匹配中的路径交叉问题	flow matching	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗
36	Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation	提出BriGeS以解决单目深度估计中的几何与语义融合问题	depth estimation monocular depth foundation model
37	AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views	提出AnySplat以解决无标定视图下的新视图合成问题	3D gaussian splatting gaussian splatting splatting	✅
38	ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS	提出ZPressor以解决3D高斯点云模型的可扩展性问题	3D gaussian splatting 3DGS gaussian splatting
39	MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence	提出MMSI-Bench以解决多图像空间智能评估问题	scene reconstruction large language model multimodal
40	TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models	提出TextRegion以解决图像文本模型在细节理解上的不足	open-vocabulary open vocabulary	✅
41	CLDTracker: A Comprehensive Language Description for Visual Tracking	提出CLDTracker以解决视觉跟踪中的语言描述不足问题	open-vocabulary open vocabulary	✅
42	PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views	提出PhysicsNeRF以解决稀疏视图下的3D重建问题	NeRF neural radiance field

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
43	Semantics-Aware Human Motion Generation from Audio Instructions	提出基于音频指令的人体动作生成框架以解决语义匹配问题	motion generation

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
44	To Trust Or Not To Trust Your Vision-Language Model's Prediction	提出TrustVLM以解决视觉语言模型预测可信度问题	IMoS multimodal	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
45	Weakly-supervised Localization of Manipulated Image Regions Using Multi-resolution Learned Features	提出弱监督方法以解决图像篡改区域定位问题	manipulation

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
46	VITON-DRR: Details Retention Virtual Try-on via Non-rigid Registration	提出VITON-DRR以解决虚拟试衣中细节保留问题	feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-05-29）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (25 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册