cs.CV(2025-08-25)

📊 共 34 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (12 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
1 EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images 提出EndoUFM以解决内窥镜图像单目深度估计问题 depth estimation monocular depth foundation model
2 FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses 提出FastAvatar以解决单图3D人脸重建问题 3D gaussian splatting 3DGS gaussian splatting
3 Camera Pose Refinement via 3D Gaussian Splatting 提出GS-SMC以解决相机姿态精确度不足的问题 3D gaussian splatting 3DGS gaussian splatting
4 GSVisLoc: Generalizable Visual Localization for Gaussian Splatting Scene Representations 提出GSVisLoc以解决3D高斯点云场景定位问题 3D gaussian splatting 3DGS gaussian splatting
5 ArgusCogito: Chain-of-Thought for Cross-Modal Synergy and Omnidirectional Reasoning in Camouflaged Object Segmentation 提出ArgusCogito以解决伪装物体分割中的认知深度问题 scene understanding semantic map chain-of-thought
6 IDU: Incremental Dynamic Update of Existing 3D Virtual Environments with New Imagery Data 提出增量动态更新方法以高效维护3D虚拟环境 3D gaussian splatting 3DGS gaussian splatting
7 HLG: Comprehensive 3D Room Construction via Hierarchical Layout Generation 提出层次布局生成方法以解决细粒度3D场景生成问题 scene understanding physically plausible embodied AI
8 DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance 提出DoGFlow以解决自监督LiDAR场景流估计问题 scene flow
9 Adaptive Visual Navigation Assistant in 3D RPGs 提出自适应视觉导航助手以解决3D RPG游戏中的导航问题 affordance
10 SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization 提出SAIL-Recon以解决大规模SfM问题 VGGT
11 MESTI-MEGANet: Micro-expression Spatio-Temporal Image and Micro-expression Gradient Attention Networks for Micro-expression Recognition 提出MESTI-MEGANet以解决微表情识别挑战 optical flow
12 NGD: Neural Gradient Based Deformation for Monocular Garment Reconstruction 提出NGD方法以解决单目视频服装重建问题 implicit representation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
13 AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering 提出自适应视觉锚定策略以解决多图像问答中的视觉冗余问题 large language model multimodal
14 UniAPO: Unified Multimodal Automated Prompt Optimization 提出UniAPO以解决多模态自动提示优化问题 large language model multimodal
15 DemoBias: An Empirical Study to Trace Demographic Biases in Vision Foundation Models 提出DemoBias以追踪视觉基础模型中的人口统计偏见问题 foundation model
16 MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs 提出MMTok以解决视觉语言模型的冗余推理效率问题 multimodal
17 Object Detection with Multimodal Large Vision-Language Models: An In-depth Review 综述多模态大规模视觉语言模型在物体检测中的应用与挑战 multimodal
18 CEIDM: A Controlled Entity and Interaction Diffusion Model for Enhanced Text-to-Image Generation 提出CEIDM以解决文本到图像生成中的实体与交互控制问题 large language model chain-of-thought
19 Instant Preference Alignment for Text-to-Image Diffusion Models 提出即时偏好对齐框架以解决文本到图像生成问题 large language model multimodal
20 Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance 提出跨模态差异化量化框架以解决视觉障碍辅助问题 multimodal
21 Seeing Like a Designer Without One: A Study on Unsupervised Slide Quality Assessment via Designer Cue Augmentation 提出无监督幻灯片质量评估方法以提升设计反馈 multimodal
22 VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference 提出VISA以解决多模态大语言模型推理效率低下问题 large language model
23 UniSino: Physics-Driven Foundational Model for Universal CT Sinogram Standardization 提出UniSino以解决CT成像中标准化问题 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
24 Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images 提出SegEarth-OV以解决遥感图像的无注释开放词汇分割问题 distillation open-vocabulary open vocabulary
25 InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency 提出InternVL3.5以提升多模态模型的推理能力与效率 reinforcement learning offline RL multimodal
26 Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling 提出上下文感知的零样本异常检测框架以解决监控视频中的异常检测问题 predictive model spatiotemporal
27 Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation 提出Visual-CoG以解决文本到图像生成中的多属性和模糊提示问题 reinforcement learning chain-of-thought
28 Few-shot Human Action Anomaly Detection via a Unified Contrastive Learning Framework 提出统一对比学习框架以解决少样本人类动作异常检测问题 contrastive learning foundation model
29 Fence off Anomaly Interference: Cross-Domain Distillation for Fully Unsupervised Anomaly Detection 提出跨域蒸馏框架以解决无监督异常检测中的干扰问题 teacher-student distillation
30 F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model 提出F2RVLM以解决多模态长对话中的细粒度片段检索问题 reinforcement learning multimodal
31 HERO: Hierarchical Extrapolation and Refresh for Efficient World Models 提出HERO框架以解决世界模型推理效率低下问题 world model

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
32 Why Relational Graphs Will Save the Next Generation of Vision Foundation Models? 提出动态关系图以提升视觉基础模型的推理能力 manipulation egocentric foundation model
33 Propose and Rectify: A Forensics-Driven MLLM Framework for Image Manipulation Localization 提出Propose-Rectify框架以解决图像篡改定位问题 manipulation large language model multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
34 TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints 提出TinyGiantVLM以解决工业环境中的空间推理问题 spatial relationship multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页