cs.CV(2025-09-26)

📊 共 62 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (22 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (16 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (15 🔗6) 支柱一:机器人控制 (Robot Control) (6 🔗2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (22 篇)

#题目一句话要点标签🔗
1 Explaining multimodal LLMs via intra-modal token interactions 通过模态内token交互增强多模态LLM的可解释性 large language model multimodal
2 WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM WAVE:利用多模态LLM学习统一且通用的音视频嵌入 large language model multimodal
3 JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation JanusVLN:利用双重隐式记忆解耦语义与空间信息,提升视觉语言导航性能。 VLN large language model multimodal
4 Introducing Multimodal Paradigm for Learning Sleep Staging PSG via General-Purpose Model 提出基于通用多模态模型的睡眠分期新范式,提升PSG分析的准确性和鲁棒性 multimodal
5 Effectiveness of Large Multimodal Models in Detecting Disinformation: Experimental Results 利用GPT-4o模型,结合优化Prompt工程,解决多模态信息伪造检测难题 multimodal
6 MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning 提出MILR,通过测试时潜在推理提升多模态图像生成质量。 multimodal
7 Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models 提出基于感知的地理空间思维链Geo-CoT,提升遥感视觉-语言模型推理能力 chain-of-thought
8 MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models MultiMat:利用大型多模态模型进行程序化材质的多模态程序合成 multimodal
9 DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images 提出基于Stable Diffusion的多模态方法DeHate,以缓解图像中的仇恨言论 multimodal
10 On the Status of Foundation Models for SAR Imagery 探索SAR图像的Foundation Model:自监督微调DINOv2实现目标识别新SOTA foundation model
11 DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation DynaNav:针对高效视觉导航的动态特征与层选择方法 embodied AI foundation model
12 FishAI 2.0: Marine Fish Image Classification with Multi-modal Few-shot Learning FishAI 2.0:融合多模态少样本学习的海洋鱼类图像分类框架 large language model multimodal
13 LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision 提出Labeling Copilot,用于计算机视觉中自动化数据标注的深度研究Agent。 foundation model multimodal
14 UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning 提出UML-CoT框架,利用UML进行机器人房间清洁任务的结构化推理与规划 large language model chain-of-thought
15 Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation EAGLE:一种轻量级框架,用于解释多模态大语言模型自回归token生成过程。 large language model multimodal
16 Exposing Hallucinations To Suppress Them: VLMs Representation Editing With Generative Anchors 提出基于生成锚点的VLM表征编辑方法,抑制多模态大语言模型的幻觉问题。 large language model multimodal
17 Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning Geo-R1:通过强化微调提升少样本地理空间指代表达理解能力 large language model multimodal
18 CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process CircuitSense:提出电路系统基准,桥接工程设计中的视觉理解与符号推理。 large language model
19 A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation 提出专家协同学习框架EXCL,解决无源域无监督域自适应问题 multimodal
20 From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs 提出BaPA平衡位置编码方法,提升LVLM的空间鲁棒性 multimodal
21 DiTraj: training-free trajectory control for video diffusion transformer 提出DiTraj,一种面向视频扩散Transformer的免训练轨迹控制框架 large language model
22 UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models UniVid:利用预训练视频生成模型统一视觉任务 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (16 篇)

#题目一句话要点标签🔗
23 Learning Unified Representation of 3D Gaussian Splatting 提出基于连续子流形场的3D高斯溅射统一表征方法,提升神经网络学习效率。 3D gaussian splatting 3DGS gaussian splatting
24 Polysemous Language Gaussian Splatting via Matching-based Mask Lifting 提出MUSplat,通过匹配的掩码提升实现多义语言高斯溅射,无需场景重训练。 3D gaussian splatting 3DGS gaussian splatting
25 Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics 提出轻量级结构化多模态推理框架,用于机器人临床场景理解 scene understanding multimodal chain-of-thought
26 Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach 提出一种开放词汇、多方面、可扩展的视觉情感评估方法,用于评估多模态大语言模型的情感理解能力。 open-vocabulary open vocabulary large language model
27 Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting 利用2D高斯溅射压缩图像表示实现视觉-语言对齐 gaussian splatting splatting multimodal
28 EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model EfficientDepth:一种快速且保留细节的单目深度估计模型 depth estimation monocular depth geometric consistency
29 GS-2M: Gaussian Splatting for Joint Mesh Reconstruction and Material Decomposition GS-2M:基于高斯溅射的联合网格重建与材质分解方法 3D gaussian splatting gaussian splatting splatting
30 CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach 提出CCNeXt,一种高效的自监督立体深度估计方法,在计算成本和精度间取得平衡。 depth estimation stereo depth
31 Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding 提出系统基准以解决视觉模型空间理解不足问题 scene understanding foundation model
32 UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective UrbanFeel:提出一个综合性城市街景理解benchmark,关注时序变化和人类感知。 scene understanding large language model multimodal
33 DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining DeLiVR:利用时空Lie群微分偏置实现高效视频去雨 optical flow spatiotemporal
34 SingRef6D: Monocular Novel Object Pose Estimation with a Single RGB Reference SingRef6D:基于单张RGB参考图像的新物体单目6D位姿估计 Depth Anything 6D pose estimation spatial relationship
35 Large Material Gaussian Model for Relightable 3D Generation 提出Large Material Gaussian Model,实现可动态光照的3D内容生成,解决现有方法材质属性缺失问题。 3D gaussian splatting gaussian splatting splatting
36 Drag4D: Align Your Motion with Text-Driven 3D Scene Generation Drag4D:提出文本驱动的3D场景生成框架,实现交互式物体运动控制 gaussian splatting splatting
37 Dynamic Novel View Synthesis in High Dynamic Range 提出HDR-4DGS,解决高动态范围动态场景的新视角合成问题。 gaussian splatting splatting
38 DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints DualFocus:利用空域-焦域双重变分约束的景深估计方法 depth estimation

🔬 支柱二:RL算法与架构 (RL & Architecture) (15 篇)

#题目一句话要点标签🔗
39 Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization 提出CapPO,通过Caption正则化策略优化提升多模态大语言模型感知一致性推理能力 reinforcement learning large language model multimodal
40 On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations 提出RobustVLA,增强视觉-语言-动作模型在多模态扰动下的鲁棒性 flow matching vision-language-action VLA
41 Multimodal Slice Interaction Network Enhanced by Transfer Learning for Precise Segmentation of Internal Gross Tumor Volume in Lung Cancer PET/CT Imaging 提出基于迁移学习和多模态交互网络的肺癌IGTV精确分割方法 Mamba multimodal
42 Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization 提出基于相对-绝对策略优化的Aes-R1框架,提升多模态大语言模型的美学推理能力。 reinforcement learning large language model multimodal
43 TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses 提出TRUST,利用不确定性引导的SSM遍历进行测试时优化,提升模型在分布偏移下的鲁棒性。 Mamba SSM state space model
44 SPARK: Synergistic Policy And Reward Co-Evolving Framework 提出SPARK框架以解决RLHF与RLVR的效率与准确性问题 reinforcement learning RLHF large language model
45 PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning 提出PSTTS即插即用模块,有效提升事件数据时空表征学习的效率。 Mamba representation learning
46 VideoScore2: Think before You Score in Generative Video Evaluation VideoScore2:提出多维度、可解释的视频生成评估框架,提升评估准确性和可控性。 reinforcement learning chain-of-thought
47 CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning 提出CapRL,利用强化学习提升图像描述的稠密性和质量。 reinforcement learning
48 NIFTY: a Non-Local Image Flow Matching for Texture Synthesis NIFTY:一种用于纹理合成的非局部图像流匹配方法 flow matching
49 Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models 提出基于规则的强化学习方法,提升视觉语言模型在文档图像分类任务中的泛化能力。 reinforcement learning
50 Joint graph entropy knowledge distillation for point cloud classification and robustness against corruptions 提出联合图熵知识蒸馏以解决3D点云分类问题 distillation
51 ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models 提出ERGO,通过粗到精推理提升视觉语言模型在高分辨率图像理解中的效率。 reinforcement learning multimodal
52 PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data 提出PartSAM以解决3D物体分割中的几何理解问题 representation learning foundation model
53 MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning 提出MIRG-RL框架,利用强化学习提升多图推理和定位能力 reinforcement learning

🔬 支柱一:机器人控制 (Robot Control) (6 篇)

#题目一句话要点标签🔗
54 Training-Free Multimodal Deepfake Detection via Graph Reasoning 提出GASP-ICL框架,无需训练即可实现多模态Deepfake检测。 manipulation multimodal
55 MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation 提出MoWM:一种混合世界模型的具身规划方法,通过潜在到像素特征调制提升性能。 manipulation world model
56 LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE LongScape:提出上下文感知MoE的长时程具身世界模型,解决视频生成中的时序不一致问题。 manipulation world model
57 MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning MesaTask:提出基于3D空间推理的任务驱动型桌面场景生成框架 manipulation DPO physically plausible
58 TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation 提出TDEdit框架以解决文本与拖拽交互的图像编辑问题 manipulation
59 DragGANSpace: Latent Space Exploration and Control for GANs DragGANSpace:融合PCA的GAN潜在空间探索与控制方法 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
60 Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs 提出DeeptraceReward以解决AI生成视频的伪造检测问题 spatiotemporal multimodal TAMP
61 Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm GLARIFY:利用时空注视信息解决视觉助手交互中的歧义性问题 spatiotemporal chain-of-thought

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
62 EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking EgoInstruct:用于人际教学交互的自中心视频数据集与多模态LLM基准测试 egocentric large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页