cs.CV(2025-10-10)

📊 共 36 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (17 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (10 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (17 篇)

#题目一句话要点标签🔗
1 Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras 提出多模态大语言模型以解决肩部疾病诊断问题 large language model multimodal
2 PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs PhysToolBench:首个面向MLLM的物理工具理解能力评测基准 embodied AI vision-language-action VLA
3 Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects 提出面向视觉-语言-动作模型的物理对象后门攻击GoBA,实现目标导向的恶意行为。 embodied AI vision-language-action VLA
4 BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception BLINK-Twice:提出视觉感知推理基准,强调细粒度观察与分析,挑战多模态大语言模型。 large language model foundation model multimodal
5 Task-Aware Resolution Optimization for Visual Large Language Models 提出任务感知分辨率优化方法,提升视觉大语言模型在不同任务上的性能 large language model
6 Towards Understanding Ambiguity Resolution in Multimodal Inference of Meaning 研究多模态语境下外语学习者对词义歧义消解的推理能力 multimodal
7 Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models 提出动态链式思考方法,提升视觉-语言模型在多模态关键短语预测任务上的性能 chain-of-thought
8 Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation 提出TEMA-LLM,利用LLM增强的多注意力机制解决跨域序列推荐问题 large language model
9 Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition Cattle-CLIP:利用多模态学习框架进行牛行为识别,提升数据稀缺场景下的性能。 multimodal
10 MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation 提出MSDM,一种多模态条件扩散模型,用于生成细胞和细胞核分割任务的病理图像。 multimodal
11 Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping 提出AttWarp,利用注意力引导图像扭曲提升多模态大语言模型性能 large language model multimodal
12 CapGeo: A Caption-Assisted Approach to Geometric Reasoning CapGeo:一种基于图文描述的几何推理方法 large language model multimodal
13 HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images 提出HandEval,用于评估生成图像中手部质量,提升AIGC应用效果。 large language model multimodal
14 Hierarchical Scheduling for Multi-Vector Image Retrieval HiMIR:面向多向量图像检索的分层调度框架,提升精度和效率 large language model multimodal
15 Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation 提出聚类感知的提示集成学习框架,提升少样本视觉-语言模型的适应性 zero-shot transfer
16 On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models 针对大视觉语言模型中的对象幻觉,提出基于视觉token认知不确定性的缓解方法 large language model
17 RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos 提出RO-Bench,用于大规模评估MLLM在文本驱动对抗视频上的鲁棒性 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
18 Spotlight on Token Perception for Multimodal Reinforcement Learning 提出VPPO,通过关注token感知优化多模态强化学习,提升LVLM的推理能力。 reinforcement learning multimodal chain-of-thought
19 Vision Language Models: A Survey of 26K Papers 大规模视觉语言模型研究趋势分析:基于2.6万篇论文的综合调研 distillation gaussian splatting splatting
20 Minkowski-MambaNet: A Point Cloud Framework with Selective State Space Models for Forest Biomass Quantification 提出Minkowski-MambaNet,利用选择性状态空间模型进行森林生物量量化。 Mamba SSM state space model
21 Unleashing Perception-Time Scaling to Multimodal Reasoning Models 提出感知时间尺度调整(PTS),提升多模态推理模型在视觉感知任务中的精度。 reinforcement learning multimodal
22 MambaH-Fit: Rethinking Hyper-surface Fitting-based Point Cloud Normal Estimation via State Space Modelling 提出MambaH-Fit,利用状态空间模型提升点云法向量估计精度 Mamba state space model
23 Foraging with the Eyes: Dynamics in Human Visual Gaze and Deep Predictive Modeling 揭示人类视觉搜寻模式:基于眼动数据的Levy行走与深度预测模型 predictive model spatiotemporal
24 An uncertainty-aware framework for data-efficient multi-view animal pose estimation 提出不确定性感知框架,高效解决数据稀缺下的多视角动物姿态估计问题 distillation geometric consistency
25 RadioFlow: Efficient Radio Map Construction Framework with Flow Matching 提出RadioFlow以解决无线电图生成效率低的问题 flow matching
26 Instance-Level Generation for Representation Learning 提出一种实例级别生成方法,无需真实图像即可提升实例识别表征学习。 representation learning
27 PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning 提出PHyCLIP以解决视觉语言表示学习中的层次与组合性问题 representation learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
28 Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes VAD-GS:面向动态城市场景,基于可见性推理的3D高斯溅射稠密化方法 3D gaussian splatting 3DGS gaussian splatting
29 Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation 提出Hybrid-depth框架,利用粗细粒度特征融合和语言引导提升自监督单目深度估计性能 depth estimation monocular depth foundation model
30 Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption 提出oVDA,通过缓存和掩码技术实现低内存、在线视频深度估计 depth estimation Depth Anything large language model
31 Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding 提出SOC:一种可扩展、精确的合成对象组合方法,用于提升检测、分割和定位任务性能。 open-vocabulary open vocabulary visual grounding
32 LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates LTGS:基于稀疏视图更新的长时高斯场景时间线建模 gaussian splatting splatting
33 Geometry-Aware Scene Configurations for Novel View Synthesis 提出几何感知场景配置方法,提升室内场景新视角合成效果 NeRF neural radiance field
34 FLOWING: Implicit Neural Flows for Structure-Preserving Morphing FLOWING:提出隐式神经流方法,实现结构保持的形变 gaussian splatting splatting
35 Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement 提出DWTA-Net,通过动态权重时序聚合增强低光视频质量,有效抑制噪声。 optical flow

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
36 VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation 提出VITA-VLA,通过动作专家蒸馏高效训练视觉-语言模型以执行机器人动作 manipulation distillation VLA

⬅️ 返回 cs.CV 首页 · 🏠 返回主页