cs.CV（2025-10-10）

📊 共 36 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (17 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (10 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (17 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras	提出多模态大语言模型以解决肩部疾病诊断问题	large language model multimodal
2	PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs	PhysToolBench：首个面向MLLM的物理工具理解能力评测基准	embodied AI vision-language-action VLA
3	Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects	提出面向视觉-语言-动作模型的物理对象后门攻击GoBA，实现目标导向的恶意行为。	embodied AI vision-language-action VLA	✅
4	BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception	BLINK-Twice：提出视觉感知推理基准，强调细粒度观察与分析，挑战多模态大语言模型。	large language model foundation model multimodal	✅
5	Task-Aware Resolution Optimization for Visual Large Language Models	提出任务感知分辨率优化方法，提升视觉大语言模型在不同任务上的性能	large language model
6	Towards Understanding Ambiguity Resolution in Multimodal Inference of Meaning	研究多模态语境下外语学习者对词义歧义消解的推理能力	multimodal
7	Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models	提出动态链式思考方法，提升视觉-语言模型在多模态关键短语预测任务上的性能	chain-of-thought	✅
8	Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation	提出TEMA-LLM，利用LLM增强的多注意力机制解决跨域序列推荐问题	large language model
9	Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition	Cattle-CLIP：利用多模态学习框架进行牛行为识别，提升数据稀缺场景下的性能。	multimodal
10	MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation	提出MSDM，一种多模态条件扩散模型，用于生成细胞和细胞核分割任务的病理图像。	multimodal
11	Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping	提出AttWarp，利用注意力引导图像扭曲提升多模态大语言模型性能	large language model multimodal
12	CapGeo: A Caption-Assisted Approach to Geometric Reasoning	CapGeo：一种基于图文描述的几何推理方法	large language model multimodal
13	HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images	提出HandEval，用于评估生成图像中手部质量，提升AIGC应用效果。	large language model multimodal
14	Hierarchical Scheduling for Multi-Vector Image Retrieval	HiMIR：面向多向量图像检索的分层调度框架，提升精度和效率	large language model multimodal
15	Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation	提出聚类感知的提示集成学习框架，提升少样本视觉-语言模型的适应性	zero-shot transfer
16	On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models	针对大视觉语言模型中的对象幻觉，提出基于视觉token认知不确定性的缓解方法	large language model
17	RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos	提出RO-Bench，用于大规模评估MLLM在文本驱动对抗视频上的鲁棒性	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
18	Spotlight on Token Perception for Multimodal Reinforcement Learning	提出VPPO，通过关注token感知优化多模态强化学习，提升LVLM的推理能力。	reinforcement learning multimodal chain-of-thought
19	Vision Language Models: A Survey of 26K Papers	大规模视觉语言模型研究趋势分析：基于2.6万篇论文的综合调研	distillation gaussian splatting splatting
20	Minkowski-MambaNet: A Point Cloud Framework with Selective State Space Models for Forest Biomass Quantification	提出Minkowski-MambaNet，利用选择性状态空间模型进行森林生物量量化。	Mamba SSM state space model
21	Unleashing Perception-Time Scaling to Multimodal Reasoning Models	提出感知时间尺度调整(PTS)，提升多模态推理模型在视觉感知任务中的精度。	reinforcement learning multimodal
22	MambaH-Fit: Rethinking Hyper-surface Fitting-based Point Cloud Normal Estimation via State Space Modelling	提出MambaH-Fit，利用状态空间模型提升点云法向量估计精度	Mamba state space model
23	Foraging with the Eyes: Dynamics in Human Visual Gaze and Deep Predictive Modeling	揭示人类视觉搜寻模式：基于眼动数据的Levy行走与深度预测模型	predictive model spatiotemporal
24	An uncertainty-aware framework for data-efficient multi-view animal pose estimation	提出不确定性感知框架，高效解决数据稀缺下的多视角动物姿态估计问题	distillation geometric consistency
25	RadioFlow: Efficient Radio Map Construction Framework with Flow Matching	提出RadioFlow以解决无线电图生成效率低的问题	flow matching	✅
26	Instance-Level Generation for Representation Learning	提出一种实例级别生成方法，无需真实图像即可提升实例识别表征学习。	representation learning
27	PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning	提出PHyCLIP以解决视觉语言表示学习中的层次与组合性问题	representation learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
28	Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes	VAD-GS：面向动态城市场景，基于可见性推理的3D高斯溅射稠密化方法	3D gaussian splatting 3DGS gaussian splatting
29	Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation	提出Hybrid-depth框架，利用粗细粒度特征融合和语言引导提升自监督单目深度估计性能	depth estimation monocular depth foundation model	✅
30	Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption	提出oVDA，通过缓存和掩码技术实现低内存、在线视频深度估计	depth estimation Depth Anything large language model
31	Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding	提出SOC：一种可扩展、精确的合成对象组合方法，用于提升检测、分割和定位任务性能。	open-vocabulary open vocabulary visual grounding
32	LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates	LTGS：基于稀疏视图更新的长时高斯场景时间线建模	gaussian splatting splatting
33	Geometry-Aware Scene Configurations for Novel View Synthesis	提出几何感知场景配置方法，提升室内场景新视角合成效果	NeRF neural radiance field
34	FLOWING: Implicit Neural Flows for Structure-Preserving Morphing	FLOWING：提出隐式神经流方法，实现结构保持的形变	gaussian splatting splatting	✅
35	Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement	提出DWTA-Net，通过动态权重时序聚合增强低光视频质量，有效抑制噪声。	optical flow

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation	提出VITA-VLA，通过动作专家蒸馏高效训练视觉-语言模型以执行机器人动作	manipulation distillation VLA

⬅️ 返回 cs.CV 首页 · 🏠 返回主页