cs.CV(2025-06-11)

📊 共 40 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗2) 支柱一:机器人控制 (Robot Control) (5) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search 提出AutoCaption框架以解决视频字幕生成评估问题 large language model multimodal
2 EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models 提出EfficientVLA以解决VLA模型的加速与压缩问题 vision-language-action VLA
3 Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy 提出Kvasir-VQA-x1以解决医疗视觉问答数据集不足问题 large language model multimodal
4 OctoNav: Towards Generalist Embodied Navigation 提出OctoNav以解决多模态导航任务的统一性问题 embodied AI VLA VLN
5 AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation 提出AnimateAnyMesh以解决高质量3D模型动画生成问题 foundation model
6 Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets 提出基于类别相似性的多模态分类方法以解决异构类别集问题 multimodal
7 Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation 提出CCELLA以解决医学影像数据稀缺问题 large language model foundation model
8 HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding 提出HSENet以解决3D医学图像理解中的语言-视觉融合问题 large language model multimodal
9 Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding 提出ReVisiT以解决视觉信息在LVLM解码中的不足 multimodal visual grounding
10 Digitization of Document and Information Extraction using OCR 提出结合OCR与大语言模型的框架以提升文档信息提取准确性 large language model
11 DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision 提出DreamCS以解决文本到3D生成中的几何偏差问题 large language model
12 Q-SAM2: Accurate Quantization for Segment Anything Model 2 提出Q-SAM2以解决SAM2模型在资源受限设备上的量化问题 foundation model
13 LLM-to-Phy3D: Physically Conform Online 3D Object Generation with LLMs 提出LLM-to-Phy3D以解决物理约束下的3D对象生成问题 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
14 DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction 提出DynaSplat以解决动态场景重建问题 gaussian splatting splatting scene reconstruction
15 HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene 提出HAIF-GS以解决动态场景重建中的一致性问题 3D gaussian splatting 3DGS gaussian splatting
16 Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation 提出Vireo框架以解决开放词汇领域泛化语义分割问题 open-vocabulary open vocabulary foundation model
17 Accurate and efficient zero-shot 6D pose estimation with frozen foundation models 提出FreeZeV2以解决零-shot 6D姿态估计问题 6D pose estimation foundation model
18 Self-Supervised Multi-Part Articulated Objects Modeling via Deformable Gaussian Splatting and Progressive Primitive Segmentation 提出DeGSS框架以解决多部件关节物体建模问题 gaussian splatting splatting
19 Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS 提出全局高斯混合简化方法以解决3D高斯点云渲染的内存问题 3D gaussian splatting 3DGS gaussian splatting
20 MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images 提出MetricHMSR以解决单目图像中的人类姿态与场景恢复问题 depth estimation metric depth
21 The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge 提出一种新颖的视图合成方法以解决稀疏无姿态图像的问题 3DGS NeRF
22 Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes 提出一种方法以预测3D场景中手部交互的声音 scene reconstruction

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
23 UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting 提出UniPre3D以解决3D点云统一表示学习问题 representation learning gaussian splatting splatting
24 Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation 提出视觉扰动框架以提升多模态推理能力 DPO large language model multimodal
25 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation 提出几何蒸馏方法以提升视觉语言模型的3D理解能力 distillation VGGT foundation model
26 SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields 提出SemanticSplat以解决3D场景理解中的语义与几何建模问题 distillation scene understanding open-vocabulary
27 Towards a general-purpose foundation model for fMRI analysis 提出NeuroSTORM以解决fMRI分析的可重复性与迁移性问题 Mamba foundation model
28 ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs 提出ViCrit以解决视觉语言模型中的视觉感知问题 reinforcement learning large language model
29 PlayerOne: Egocentric World Simulator 提出PlayerOne以解决真实世界模拟的挑战 world model egocentric
30 Synthetic Geology: Structural Geology Meets Deep Learning 提出StructuralGeo以解决地质重建中的数据稀缺问题 flow matching foundation model
31 MMME: A Spontaneous Multi-Modal Micro-Expression Dataset Enabling Visual-Physiological Fusion 提出MMME数据集以解决多模态微表情分析问题 MAE multimodal

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
32 Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing 提出通过绘图增强视觉语言模型的空间推理能力 manipulation reinforcement learning spatial relationship
33 CHIP: A multi-sensor dataset for 6D pose estimation of chairs in industrial settings 提出CHIP数据集以解决工业环境中椅子的6D姿态估计问题 manipulation 6D pose estimation
34 Benchmarking Gaslighting Negation Attacks Against Reasoning Models 提出GaslightingBench-R以评估推理模型对否定攻击的抵抗力 manipulation multimodal chain-of-thought
35 VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models 提出VITA以解决视觉语言模型的零-shot价值函数问题 manipulation reinforcement learning offline reinforcement learning
36 CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation 提出CheckManual基准以解决手动电器操作的挑战 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
37 LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning 提出基于掩膜的LoRA微调方法以实现灵活的视频编辑 spatiotemporal
38 MPFNet: A Multi-Prior Fusion Network with a Progressive Training Strategy for Micro-Expression Recognition 提出MPFNet以解决微表情识别中的多源信息融合问题 spatiotemporal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
39 InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions 提出InterActHuman框架以解决多概念人类动画问题 human-object interaction spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
40 A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs 提出最小视频对基准以解决视频语言模型的物理理解问题 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页