cs.CV（2025-09-26）

📊 共 62 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (22 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (16 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (15 🔗6) 支柱一：机器人控制 (Robot Control) (6 🔗2) 支柱八：物理动画 (Physics-based Animation) (2) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (22 篇)

#	题目	一句话要点	标签	🔗
1	Explaining multimodal LLMs via intra-modal token interactions	通过模态内token交互增强多模态LLM的可解释性	large language model multimodal
2	WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM	WAVE：利用多模态LLM学习统一且通用的音视频嵌入	large language model multimodal
3	JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation	JanusVLN：利用双重隐式记忆解耦语义与空间信息，提升视觉语言导航性能。	VLN large language model multimodal	✅
4	Introducing Multimodal Paradigm for Learning Sleep Staging PSG via General-Purpose Model	提出基于通用多模态模型的睡眠分期新范式，提升PSG分析的准确性和鲁棒性	multimodal
5	Effectiveness of Large Multimodal Models in Detecting Disinformation: Experimental Results	利用GPT-4o模型，结合优化Prompt工程，解决多模态信息伪造检测难题	multimodal
6	MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning	提出MILR，通过测试时潜在推理提升多模态图像生成质量。	multimodal
7	Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models	提出基于感知的地理空间思维链Geo-CoT，提升遥感视觉-语言模型推理能力	chain-of-thought
8	MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models	MultiMat：利用大型多模态模型进行程序化材质的多模态程序合成	multimodal
9	DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images	提出基于Stable Diffusion的多模态方法DeHate，以缓解图像中的仇恨言论	multimodal
10	On the Status of Foundation Models for SAR Imagery	探索SAR图像的Foundation Model：自监督微调DINOv2实现目标识别新SOTA	foundation model
11	DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation	DynaNav：针对高效视觉导航的动态特征与层选择方法	embodied AI foundation model
12	FishAI 2.0: Marine Fish Image Classification with Multi-modal Few-shot Learning	FishAI 2.0：融合多模态少样本学习的海洋鱼类图像分类框架	large language model multimodal
13	LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision	提出Labeling Copilot，用于计算机视觉中自动化数据标注的深度研究Agent。	foundation model multimodal
14	UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning	提出UML-CoT框架，利用UML进行机器人房间清洁任务的结构化推理与规划	large language model chain-of-thought
15	Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation	EAGLE：一种轻量级框架，用于解释多模态大语言模型自回归token生成过程。	large language model multimodal	✅
16	Exposing Hallucinations To Suppress Them: VLMs Representation Editing With Generative Anchors	提出基于生成锚点的VLM表征编辑方法，抑制多模态大语言模型的幻觉问题。	large language model multimodal
17	Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning	Geo-R1：通过强化微调提升少样本地理空间指代表达理解能力	large language model multimodal	✅
18	CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process	CircuitSense：提出电路系统基准，桥接工程设计中的视觉理解与符号推理。	large language model
19	A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation	提出专家协同学习框架EXCL，解决无源域无监督域自适应问题	multimodal
20	From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs	提出BaPA平衡位置编码方法，提升LVLM的空间鲁棒性	multimodal
21	DiTraj: training-free trajectory control for video diffusion transformer	提出DiTraj，一种面向视频扩散Transformer的免训练轨迹控制框架	large language model
22	UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models	UniVid：利用预训练视频生成模型统一视觉任务	large language model	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (16 篇)

#	题目	一句话要点	标签	🔗
23	Learning Unified Representation of 3D Gaussian Splatting	提出基于连续子流形场的3D高斯溅射统一表征方法，提升神经网络学习效率。	3D gaussian splatting 3DGS gaussian splatting
24	Polysemous Language Gaussian Splatting via Matching-based Mask Lifting	提出MUSplat，通过匹配的掩码提升实现多义语言高斯溅射，无需场景重训练。	3D gaussian splatting 3DGS gaussian splatting
25	Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics	提出轻量级结构化多模态推理框架，用于机器人临床场景理解	scene understanding multimodal chain-of-thought
26	Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach	提出一种开放词汇、多方面、可扩展的视觉情感评估方法，用于评估多模态大语言模型的情感理解能力。	open-vocabulary open vocabulary large language model	✅
27	Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting	利用2D高斯溅射压缩图像表示实现视觉-语言对齐	gaussian splatting splatting multimodal
28	EfficientDepth: A Fast and Detail-Preserving Monocular Depth Estimation Model	EfficientDepth：一种快速且保留细节的单目深度估计模型	depth estimation monocular depth geometric consistency
29	GS-2M: Gaussian Splatting for Joint Mesh Reconstruction and Material Decomposition	GS-2M：基于高斯溅射的联合网格重建与材质分解方法	3D gaussian splatting gaussian splatting splatting
30	CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach	提出CCNeXt，一种高效的自监督立体深度估计方法，在计算成本和精度间取得平衡。	depth estimation stereo depth	✅
31	Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding	提出系统基准以解决视觉模型空间理解不足问题	scene understanding foundation model
32	UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective	UrbanFeel：提出一个综合性城市街景理解benchmark，关注时序变化和人类感知。	scene understanding large language model multimodal
33	DeLiVR: Differential Spatiotemporal Lie Bias for Efficient Video Deraining	DeLiVR：利用时空Lie群微分偏置实现高效视频去雨	optical flow spatiotemporal
34	SingRef6D: Monocular Novel Object Pose Estimation with a Single RGB Reference	SingRef6D：基于单张RGB参考图像的新物体单目6D位姿估计	Depth Anything 6D pose estimation spatial relationship
35	Large Material Gaussian Model for Relightable 3D Generation	提出Large Material Gaussian Model，实现可动态光照的3D内容生成，解决现有方法材质属性缺失问题。	3D gaussian splatting gaussian splatting splatting
36	Drag4D: Align Your Motion with Text-Driven 3D Scene Generation	Drag4D：提出文本驱动的3D场景生成框架，实现交互式物体运动控制	gaussian splatting splatting
37	Dynamic Novel View Synthesis in High Dynamic Range	提出HDR-4DGS，解决高动态范围动态场景的新视角合成问题。	gaussian splatting splatting
38	DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints	DualFocus：利用空域-焦域双重变分约束的景深估计方法	depth estimation

🔬 支柱二：RL算法与架构 (RL & Architecture) (15 篇)

#	题目	一句话要点	标签	🔗
39	Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization	提出CapPO，通过Caption正则化策略优化提升多模态大语言模型感知一致性推理能力	reinforcement learning large language model multimodal
40	On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations	提出RobustVLA，增强视觉-语言-动作模型在多模态扰动下的鲁棒性	flow matching vision-language-action VLA
41	Multimodal Slice Interaction Network Enhanced by Transfer Learning for Precise Segmentation of Internal Gross Tumor Volume in Lung Cancer PET/CT Imaging	提出基于迁移学习和多模态交互网络的肺癌IGTV精确分割方法	Mamba multimodal
42	Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization	提出基于相对-绝对策略优化的Aes-R1框架，提升多模态大语言模型的美学推理能力。	reinforcement learning large language model multimodal
43	TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses	提出TRUST，利用不确定性引导的SSM遍历进行测试时优化，提升模型在分布偏移下的鲁棒性。	Mamba SSM state space model
44	SPARK: Synergistic Policy And Reward Co-Evolving Framework	提出SPARK框架以解决RLHF与RLVR的效率与准确性问题	reinforcement learning RLHF large language model
45	PSTTS: A Plug-and-Play Token Selector for Efficient Event-based Spatio-temporal Representation Learning	提出PSTTS即插即用模块，有效提升事件数据时空表征学习的效率。	Mamba representation learning
46	VideoScore2: Think before You Score in Generative Video Evaluation	VideoScore2：提出多维度、可解释的视频生成评估框架，提升评估准确性和可控性。	reinforcement learning chain-of-thought	✅
47	CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning	提出CapRL，利用强化学习提升图像描述的稠密性和质量。	reinforcement learning	✅
48	NIFTY: a Non-Local Image Flow Matching for Texture Synthesis	NIFTY：一种用于纹理合成的非局部图像流匹配方法	flow matching	✅
49	Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models	提出基于规则的强化学习方法，提升视觉语言模型在文档图像分类任务中的泛化能力。	reinforcement learning	✅
50	Joint graph entropy knowledge distillation for point cloud classification and robustness against corruptions	提出联合图熵知识蒸馏以解决3D点云分类问题	distillation
51	ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models	提出ERGO，通过粗到精推理提升视觉语言模型在高分辨率图像理解中的效率。	reinforcement learning multimodal	✅
52	PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data	提出PartSAM以解决3D物体分割中的几何理解问题	representation learning foundation model
53	MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning	提出MIRG-RL框架，利用强化学习提升多图推理和定位能力	reinforcement learning	✅

🔬 支柱一：机器人控制 (Robot Control) (6 篇)

#	题目	一句话要点	标签	🔗
54	Training-Free Multimodal Deepfake Detection via Graph Reasoning	提出GASP-ICL框架，无需训练即可实现多模态Deepfake检测。	manipulation multimodal
55	MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation	提出MoWM：一种混合世界模型的具身规划方法，通过潜在到像素特征调制提升性能。	manipulation world model	✅
56	LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE	LongScape：提出上下文感知MoE的长时程具身世界模型，解决视频生成中的时序不一致问题。	manipulation world model	✅
57	MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning	MesaTask：提出基于3D空间推理的任务驱动型桌面场景生成框架	manipulation DPO physically plausible
58	TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation	提出TDEdit框架以解决文本与拖拽交互的图像编辑问题	manipulation
59	DragGANSpace: Latent Space Exploration and Control for GANs	DragGANSpace：融合PCA的GAN潜在空间探索与控制方法	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
60	Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs	提出DeeptraceReward以解决AI生成视频的伪造检测问题	spatiotemporal multimodal TAMP
61	Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm	GLARIFY：利用时空注视信息解决视觉助手交互中的歧义性问题	spatiotemporal chain-of-thought

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
62	EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking	EgoInstruct：用于人际教学交互的自中心视频数据集与多模态LLM基准测试	egocentric large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2025-09-26）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (22 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (16 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (15 篇)

🔬 支柱一：机器人控制 (Robot Control) (6 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册