cs.CV(2025-10-19)

📊 共 17 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (6) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)

#题目一句话要点标签🔗
1 Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input 提出Res-Bench,评估多模态大语言模型在动态分辨率输入下的鲁棒性 large language model multimodal
2 Enrich and Detect: Video Temporal Grounding with Multimodal LLMs 提出ED-VTG,利用多模态LLM进行细粒度视频时序定位 large language model multimodal
3 Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs LENS:为冻结多模态LLM提供即插即用的分割能力 large language model multimodal
4 Training-free Online Video Step Grounding 提出BaGLM,利用大模型零样本能力在线视频步骤定位,超越离线训练方法。 large language model multimodal
5 Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding 通过fMRI神经编码揭示视觉-语言模型中类脑分层模式 multimodal
6 EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction 提出EventFormer,用于解决动作中心视频事件预测任务,并构建大规模数据集AVEP。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
7 Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis 综述性分析医学影像领域中的Foundation Model,系统性地归纳架构、训练范式和临床应用。 distillation foundation model multimodal
8 A Comprehensive Survey on World Models for Embodied AI 对具身智能中世界模型的全面综述,涵盖功能、时序建模和空间表示三个维度。 world model embodied AI
9 EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation EMRRG:高效微调预训练Mamba X射线网络,用于放射报告生成 Mamba SSM large language model
10 Video Reasoning without Training 提出V-Reason,无需训练即可提升大模型在视频推理任务中的性能。 reinforcement learning multimodal chain-of-thought
11 Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback Uniworld-V2:利用扩散负感知微调和MLLM隐式反馈增强图像编辑能力 flow matching large language model multimodal
12 Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding 提出W2R2框架,解决视频LLM中3D grounding的2D语义偏见问题。 representation learning multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
13 SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes SceneCOT:提出3D场景中基于常识链的推理框架,提升具身问答性能 scene understanding large language model multimodal
14 2DGS-R: Revisiting the Normal Consistency Regularization in 2D Gaussian Splatting 2DGS-R:通过分层训练和原位克隆提升2D高斯溅射的渲染质量和几何精度 3D gaussian splatting 3DGS gaussian splatting
15 GS2POSE: Marry Gaussian Splatting to 6D Object Pose Estimation GS2POSE:结合高斯溅射的6D物体姿态估计方法 3DGS gaussian splatting splatting
16 How Universal Are SAM2 Features? 量化通用视觉模型与分割专用模型特征的泛化能力差异 depth estimation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
17 HumanCM: One Step Human Motion Prediction 提出HumanCM,一种基于一致性模型的人体运动单步预测框架 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页